Principle of Data Science with R , Hands-On Experience. | Abas M. | Skillshare

Playback Speed


  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Principle of Data Science with R , Hands-On Experience.

teacher avatar Abas M.

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

33 Lessons (3h 29m)
    • 1. Introduction

      2:18
    • 2. Introducin Rstudio

      3:32
    • 3. Objects

      7:14
    • 4. Logical Operators

      6:27
    • 5. Lists

      10:54
    • 6. Matrix

      6:40
    • 7. DATAFRAMES

      8:50
    • 8. Reading & Saving Data frame

      6:39
    • 9. For loop

      6:01
    • 10. If else Statements

      6:00
    • 11. Functions (Part1)

      5:10
    • 12. Function Part2

      8:51
    • 13. Detecting NA's

      5:21
    • 14. Tidy datasets

      10:56
    • 15. Welcome to Data Wrangling

      1:27
    • 16. Installing Packages

      5:17
    • 17. Data Subsetting filter function

      9:15
    • 18. Data Subsetting arrange function

      2:04
    • 19. Dplyr select & rename Functions

      4:02
    • 20. Dplyr summarize groub by mutate

      5:51
    • 21. Statistical Learning

      5:43
    • 22. Why Esimate F

      9:26
    • 23. Installing Packages for LinearRegression

      0:49
    • 24. Simple LinearRegression

      7:53
    • 25. Hypothesis Testing

      11:27
    • 26. Evaluating Metrics for Linear Model

      9:18
    • 27. Correlation Plots

      5:21
    • 28. Multicollinearity

      2:38
    • 29. Train&Test split

      6:33
    • 30. Simple Linear Model

      8:50
    • 31. Multiple Linear Regression

      4:30
    • 32. Diagnostic Plots

      9:49
    • 33. Scatter Plot

      3:38
  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

7

Students

--

Project

About This Class

This class teaches you the fundamental of data data science. data science is a rapidly growing field which requires from us to have an upward trend in technology. Having said that, I am eager to teach you the current knowledge needed to become a better data scientist. This continuous learning concept helps you to develop and improve your skills in order to perform effectively and to gain new perspectives as well as adapting to changes in workplaces.

This comprehensive course includes 33 lectures, each lecture is introduced in plain English, avoiding confusing mathematical notation. 

What you Will learn

In this class you will learn the fundamental of data science. however, Unlike most data science courses, this class doesn't teach all of the exhaustive data science courses, but instead, you will learn just enough material so that you will be able to manipulate and analysis the data as soon as possible. This class cover  the following concepts :

  • Introduction to R
  • Data manipulation
  • Statistical tools
  • Simple linear Regression
  • Multiple linear Regression
  • Correlation plots
  • Diagnostic  plots
  • Model evaluation
  • Predictive Modeling

This is class teaches you all of the necessary tools to analyse the data. You will learn how to build a simple linear regression as well as the different metrics to be evaluate. 

Prerequisite 

This class doesn't have any prerequisite and it is designed for anyone with no prior experience. All you need is to dedicate a time for this learning curve and a little bit of motivation. 

In this class you are going to learn some of the key concepts of data science, the goal is to equip you with the essential skills that are required to become a full data science. So stay tuned.

Meet Your Teacher

Teacher Profile Image

Abas M.

Teacher

Class Ratings

Expectations Met?
  • Exceeded!
    0%
  • Yes
    0%
  • Somewhat
    0%
  • Not really
    0%
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Introduction: Hi, my name is Alyssa Sam or data scientists or with interesting data-driven science. I have extensive experience in the planning and development of data products, and I apply a variety of skills through the practice of current technologies. I do understand that the power of data science comes from a deep understanding of necessity can on algorithms. And that's the starting point of this class. So why data science? With data scientists that are rapidly growing field which requires from us to have an upward trend and technology. Having said that, I'm eager to teach you the current knowledge needed to become a better data scientist. This continuous learning concepts helps you to develop and improve your skills in order to perform effective land, to gain new perspectives, as well as adapting to change in workplaces. When no, it is we have a massive amount of data about money aspect of our lives and simultaneously and abundance of inexpensive computing power. It is not just to Internet data, it is finance, the medical industry, pharmaceutical, bioinformatics, social welfare, education, retail, the list it goes up. There's a growing influence of data in most sectors and most industrial. So we can say that it is the right time to pursue a new carrier and to learn new skills. Well in this class, you are going to learn some of the key concepts of dataset. The first section of this class, you will learn the basics of programming in R. R is an excellent programming language and free software environment for statistical computing and graphics. In the second section of this class, we will dive into some of the principles of data manipulation. And in the last section, we'll look at how we build models form from the data. But generally speaking, this class we'll be focusing only on the essential steps in data science so you can gain a solid understanding of the key concept as soon as possible. Thank you. 2. Introducin Rstudio: Hi, Welcome to the Fed section. Getting started. In this tutorial, I'm going to show you the basic features of RStudio. I want that you familiarize the RStudio environment as you see here. So that will be very easy for you to perform data manipulation in the coming tutorials. I'm going to teach you the skills needed to master this comprehensive software and applied effectively turning given data. All right, As you see here, there are four different planets and each banana's does a specific task. We have the source panel, the console panel, and we have as well the command history panel we have as well, the viewer panel. For instance, in the source panel, is where you write the code. And anytime you're writing code inside the panel or inside the search bar that you have to run that code. So the result will be given in the console panel. Suppose a very simple example, one plus one. Or anytime you write a code inside the source panel, you have to always run the code. So the answer or the result will be given in the console panel. The console panel is like the heart of the program. It takes the code or any code you have written inside the source panel. It computes and gives back the result or the computation and the final computation. You can also write directly inside the console. Let's say one plus one. The same example. You hit the button, Enter, you'll find the answer, the result. But it's not a recommended. If you want to write long code, it's always prefer PR and recommended that you write your code inside the search bar that because you have the possibility to save that file and reuse it in another time. Here is the command history. You will find all the history you have written inside the console are inside the source panel. As you see here, we have so far written one plus one. You can see the history of the command. In the viewer panel. You can see uncheck the filenames. You can as well see and check the plots and the visualization that you make you will find in the plots. For instance, let's make a very simple visualization using the hist function with Nile dataset, the our brainstorm in R. So as you see here, you'll find out the graph. So the visualization that you make is always presented inside the plot. In the back edges, you can see and find out which packages that you want to use. Here we have the Help menu for our functions. Functions are self-contained modules of code that accomplish a specific task. Functions usually take in data, process it, and return a result. In the help menu you write the function that you want to know more about and you will find that description of that given function. Voila, I hope that you understand the different panels and I look forward to see you in the next tutorial. Thank you. 3. Objects: Hi, welcome back. I am your instructor, and I'm glad that you chose discourse. In this tutorial, I'm going to talk about the different data types that are used in R. And as the title suggests, we will be focusing on our objects, variables, vector, and character. It is very important to know and understand the various data types and data structure that exists in R. So let's find out, first of all, what is variables. In any programming language, we usually need variables to store information. So variables are used to store information and the information that we are dealing with. It could be a datapoint and number or unrelated content. In other words, provide a means of accessing the data store in memory. So in this case, using our programming language, we can create variables using the assignment operators. Here we have the assignment operators. I'm going to show you this code practically in a minute. Consider at the moment what we have here, our variables. We have V1 and V2. These are two variables. So in our programming language, we can create variables using the assignment operator in this numeric value that you see here is the data dollar store in the variable v1. And anytime you want to create a variable, just a, you need to give a name for the variable that you want to create. So basically what we have here is an object. Understand that an object is anything that you store in a variable. It could be a number or a character. In this case, we have a numeric object. We have assign a numeric value to V1. V1 is the name of the variable. In the second variable, we have a character class store inside v2 variable. All of them are objects. Understand that an object is anything that you store in a variable. But they have different type, they have different classes. In this case we have object of type numeric, while the other one is an object of class or type character. There are different class of object in R, but the most basic object in R is called vector. A vector is the basic data structure in R, and it is a sequence data element of the same type. For instance, we have here two vectors. In the first vectors, we have a vector containing four numeric values. In this case, the data type is numeric and the vector has a data element of the same type as you see here. We have 2, 3, 5, 4, and they are all same type. The data type are all the same. So this is a vector of a sequence of data element of the same time. But in the second variable, we have a vector of a character or a vector of type character. As you see here, we have the names of the character. We have teachers, students, and the school. They are no longer numeric value as we have in the first variable. So basically vectors is a data structure in R and they always have the same type as you see here. It could be or character or all numeric. There are of course, different types of vectors that we're going to see in the next coming tutorials. But at the moment, just consider that these two are different type of vectors. Let's see, perform some practice in the console panel. So we want to create a very simple variable inside the console palette. You have as well the choice to write the code inside the source banner. But for the quick demonstration, we want to write a very simple variable inside the console. But suppose we want to create a variable called x one. You have to always give a name, the variable that you want to create, followed by the assignment operator. So this is the assignment operator. Anytime you want to create a variable, it is essential to write an assignment of rights. So let's say we want to create a very simple object or variable which the type of numeric. And suppose we take these values or numeric values 123. If you hit Enter, the variable is already created and it's saved inside the memory. If you want to recall the verb that you created, just to write the name of the variable that you create in this year, obviously 61, and then hit Enter, you find out the variable that you created. But there are different object. As I said, anything you save inside the console panel or any data you created are all of them objects, but there are different objects to know exactly the type of object, you have to use a very important function called class. And then you write as the name of the variable that you want to find out its object or it's tied off. So in this case is obviously the X1 and then hit Enter, you find out the type of object, in this case is a numeric, which actually means that we created an object of type numeric inside the verbal called X1. If you want to create a character variable, use the double quotation mark and write that the variable that you want to create in this case, let's say home. And so character variable, just a piece of letter. It could be anything. So just telling the program that you want to save a variable as a character to find out you use the class function as before. And then write the name of the variable to create a vector, you need to use the c function. Remember vector is the basic data structure in R, and it's just a sequence of data elements of the same type. There are different types of vector. Suppose you want to create a character vector. Always if you want to create a character, you have to use the quotation mark, the double quotation mark. You could also use the single quotation mark. Take a do not forget, when you're making a character, you have to use the comma punctuation to separate the different values because a vector is a sequence of data elements. So let's say B is the second character value that we want to put inside the vector. And as we'll see and hit Enter, you created a vector. I hope that you found this tutorial useful. In the common tutorial, we'll be focusing on the remaining data types. Thank you. 4. Logical Operators: Hi. In this tutorial, we're going to talk about logical operators. Logical operators are used to compare objects and to perform logical operation on values. These are the operators that are used to perform object comparison and to manipulate data. Operators are just the symbols that tells the program to perform mathematical or logical manipulation. As you see here, we have different kinds of operators. We use them when we want to compare variables. Here we have less than, we have less than or equal, we have greater than. And actually it, if you want to compare, suppose you want to compare two variables. We have V1 and V2. If you want to compare or to know whether V1 is less than v2, we use this operator side. So just a desire operator sign all of them. We have here exactly equal to. If you want to make sure whether two variables are strictly or exactly equal, we use this operator here. We have different kinds of operators on exactly that I'm telling the same here. We have a not equal operators. If you want to make sure or find out whether V1 is strictly our executive not equal to V2, we use these operators. Suppose we want to select variables in a dataset. So if you want to leave some of them, you select the variable that you want to leave behind and then use this operator, which actually telling the program NOT x, x is the name of the variable. And here we have the operators are, for example, you want to select either X or Y. And here we have x and y, they are together activity, suppose you want to select in a dataset two variables, so we use this logical operators. The other thing that I want to mention is that anytime we compare two objects, the program will decide whether the comparison is true or false. For instance, if you want to compare these two variables, we have x and y. We have here a value of five assigned to this variable x and y, we have six assigned to this variable y. So here we have x. In this case, x represents five. So it's saying that five is less than six. Is that true? It's true. The program will return it true. In this case, of course is true. Five is less than six, but in case x is greater than y is obviously false, the program returns false. So these are logical values. Let's see these logical operation in action. Let's practice some of the logical operators that I've just mentioned secreted in a variable x. Let's say x is five to create another variable called y six. If you want to run this code just to highlight and then run. So both variables are inside the console to find out whether x is less than y. The program will return us. The result in this case is obviously true because x is less than y. X represent five and y represents six. That's right. In this case, x is greater than y. Then highlight, run the code. Now you see that is false because x is not greater than y. And program returns us a false, which in this case it's a logical value, and of course it's not right on this false. Let's take another logical operator. In this case, if we say x is not equal y, run the code. We have through, of course, x is not equal to y. So now let's see a vector. So basically this is a vector of type character. We have here two characters inside the vector, and we also call string. So string is anything that is inside the quotation mark. Any character that's inside quotation mark is also called string. We create another variable. So now we convert these two variables using the logical operators. Let's find out whether they are equal or not. You highlight all of them. Then we run. As you see here, we have false, false, which actually means that these two variables are not equal. The first object in the first verb, but we have B and in the variable 2, we have C. They are not equal. So let's try a numeric vector which contains 10 numbers. Highlight. You need to Eilat always to run the code. The variable highlight and hit Run. So we have here 10 different numbers. We have one to ten. But what if we want to select any number that are greater than four? So to find out who write the variable, and we use box, brackets, followed parentheses, and then write the name of the variable that you want to select from. And this case raises their variable through. Then as you said, we want to select numbers that are greater than four. So we have b greater than four. That's run this one. As you see here, these are the numbers that are greater than for using this code. So it will be very handy when we want to select from large dataset in the company tutorials and find out numbers that are less than four. To use the same code. In this case, you need to change the greater sign or the operator 0 and chain to the less than four and then hit Run. As you see here, we have three numbers. We have 1, 2, 3, which obviously are less than four walls. This was just at the basic. And I hope that you have an idea about how the logical operators are used in R. In the coming tutorials, we will dive into some more complex logical operators. 5. Lists: Hi again. In this tutorial, I'm going to talk about how to create lists and how to walk with them. Lists are special type of vector that contains elements of different classes. In contrast to a vector in which all elements must be of the same class, Lists can combine objects of different types. For instance, we have here a list object which contains different type of glasses. We have a vector and there is as well a logical operator, string or a character here as well. There is a number, a complex number. So basically, lists are objects which contain elements of different types. Could be a string, a vector, and a logical operator. So technically, a list is a vector because vector contain a sequence of data elements of the same time as you see here. This is a vector. It contains a collection of data elements that are similar in datatypes. This vector contains numerical object and the data element that is inside the vector are all the same time. But in the list object, we have elements of different types. There are different objects in the list, and that is the difference between list and a vector. Now, let's perform some practices to understand more about list objects. So to create a list, use the list function. Let's clean now a list object that the list that we want to create should contain different object. This case, let's take different object, such as a number, that logical operator, string as well. As you see here, we have different object. We have a character or a string. We have as well another string. Here there is a numeric vector, and here we have a logical operator, and here we have as well as symbol number. So let's run the code. So as you see here, we have different objects. The first one we have obviously a string or a character. The second one as well is a string. The third one is a vector. In this case it's a numeric vector. The fourth is the force. This force that you see is it logical operator? And you understand, you see in the previous tutorial the different logical operators. So these are the different kind of object that must be contained within lists. Objects. So basically a list contains elements of different classes. As you see here, there are different classes or objects within list structure. So now let's make a list indexing, which actually means to find out the objects that are inside the list structure to make list indexing, right? First of all, the name of the variable followed a bracket, then write the object that you want to index. Suppose we want to find out the second object within this structure, in this case, right, too, which means second position. And then highlight and run the code, which obviously in the second position, the same make and find out. Let's say fourth position, right, just before I let the code run. And we have false. This case it's the logical operators, which is at the fourth position. If you count these, the first position, the second, the third position is a vector and the position, as you see here, is a false, which is obviously the logical operator. Find out the type of the verbal used the class function, and then write the name of the variable, highlight. Run the code. As you see here, we have at least objects. Obviously we create a list object to find out the length. Or let's say we want to find out how many objects that are inside the list structure or the variable that you just created. In this case, we use the length function as well. The name of the variable should highlight. Run. You see here we have five and obviously we have five objects within listed variable that we created, 12345. Now we create a name list, a list which has different tags. Let's take var2 maneuverable. Here's the list list function. So basically what you see here is a list object which contains different classes. This the name of the person that you take as an example. And here is the tag name. The tag name describes the object as you see here as well. We have USA in this case, USA is their country. Country is the name or a tag name in describes the object. And here as well, 2008, which obviously a numeric object. We have a tag name, in this case is age, and this tag name describes the object. Now let's make a list indexing using the same variable that we have. Remember the first one before, we used the box bracket to make lists indexing, but in this case we have tax names, we have name country eight. So let's first start using the dollar sign. So suppose we want to take and the name, the name object. Highlight. Run the code. Well, John is the name of the first object. So we can also use the box bracket using verbal to remember anytime you want to select or list indexing, you have to write the, the position of the object that you want to select. In this case, John is the first object, so we write one highlight run, right? You see John. So these are the different ways to make list indexing. You can also use the dollar sign, which is often the more preferred for one, that you may use this one as well. Now let's find out the second object in this case, let's say country. Let's find out the name of the country. So I think you understand the way to make list indexing using the different ways, either the dollar sign or the box bracket as this three here we have USA, which obviously is the second object, the name of the tag name country. And to find out the age as well, use the variable to dollar sign. Age. Highlight around. Well, you see 28 is the age, is a numeric object. These are the different ways to select or to make list indexing. So now suppose we want to add another object inside the list of files up and structure. To do so, we use the variable or the name of the variable and then use the dollar sign. In this case, we write the name of the object that we want to create that say occupation. And then we assign. So in this case, occupation obviously is a character object or must be a character or object, as the name suggests. Run the code to see the new object that we just created. You can write the name of the variable that we have. In this case variable to obviously hit Enter. You see occupation. Occupation is already added in the list object or the variable that we have here. These are the ways to make a list indexing as well as if you want to add a new object. Suppose now we want to see using the method that you used before. You want to see the name of the occupation, Let's say var to use the dollar sign as before. Occupation is appear in the tag names. Run this code. You see that MD is Duke beijing of Jonas occupation this, yes, you could say. Now, consider that we want to know exactly the type of the object that we have inside the R2 or inside this verb or so to find out, use this GR, which is obviously a very important function. And it tells us the different kinds of objects that are inside a verbal. So now use the verbal disuse we were to run. All right, You see here we have a list of four objects. And obviously name is a character, and the second one obviously is a character, as you see here, USA is a character object and this one is a numeric object. Last one, Israelis MD, and it's a character objective. So STR two turns that kind of object and it gives some description to know exactly the different types of objects that are inside a variable. So I think you understand and I hope that you'll find this tutorial used for, and I look forward to see you in the next tutorial. 6. Matrix: Hi again. In this tutorial, we are going to learn what is matrix as well. We're going to create some of them. So a matrix is a two dimensional array where each element has the same class. For instance, we have here a matrix. Basically a matrix contains a row and a column. This is a two dimensional array where each element has the same time. We could say that a matrix is an extension of a vector. Since vector is a one-dimensional data which contains a sequence of elements of the same time. But in matrix, we have two dimensional RI, but the data elements are all of them same time. The difference between vector and matrix is that the vector contains data of the same element but remains a one-dimensional array Lean In the case of matrix, the data contains some rows and some columns. In this case we have five rows and four columns. In another word matrix is a set of numbers arranged in rows and columns. These are the data and this data is scattered in the matrix. So generally speaking, matrix is used when making a mathematical computation, especially in the areas of machine learning. Well, now let's make some practice and understand more about matrix. Let's create our first matrix. So a matrix can be created in R using the matrix function. So we create a numeric vector which contains 10 numbers. And then we specify the number of columns and the number of rows because we have a matrix and in my recipe, two-dimensional data, right? Well, if you want to highlight, I mean, if you want to run this code, you can use Control plus hit Enter. Well, we have here a matrix, we created a matrix and then the matrix we have here contains 10 numbers. And then we specify it as you see here, and the number of columns. And we have as well the number of rows. So in this case we have five columns and we have here two rows. That's this matrix. Well, in this matrix, data is coming in a column major order. As you see here, we have numbers that are arranged in a column, major order, 1, 2, 3, 4. Let's say we want to change the order. And let's say we want to make a data which is coming in a row major order. To make these, we just use an additional, an argument called by row. In this case, we use the true logical value to indicate the row major order to run this code. And then again and surround the variable. Well, I think there's a difference and you see between the first variable and the second variable and the first variable, we have a data which is coming in a column major order, as you see here, 12. But in the second variable, we just change the order and we used and these arguments and we changed into true value. And you see here the data now is coming in row major order. And that's basically the difference between the two variables using just this argument to check whether the variable is matrix are not used is matrix. So actually this tells us whether the variable is matrix or not and returns a logical value. And then write the name of the variable. Ranch code. Or these two indicates that the variable is a matrix. A matrix can be also created by binding two variables. Let's say that one in a second, create a new variable called x. And this is just a numeric variable. I mean a vector of rho, new type of numeric variable X1. What we have here two vectors are two variables. We want to create a matrix by merging or binding this variables. There are two ways to create matrix. In this case, either the column binding or row binding lets us first of all create a matrix by column binding. So in case of Column binding, see binder is used to merge two variables and then create a matrix from the column side. So we use C bind and then we write the two variables that we want to combine together. In this case it's x and x one. If you want see dysfunction, you can save or assign a variable, let's say extreme. Or let's say just write result. Then run this code. Well, as you see here, the two variables are combined or merged together. And we have X and X one in case of row bind or row binding we use are buying, spirit are buying. Write it as x two variable that we want to combine two where X1 run this code. Basically these are the difference between rbind and cbind. In the case of our B9, the verbals are merged together. As you see here, we have two rows and these two rows are merged together. But in the case of t B9, which actually means column binding. And we have two column actually. And these are the two variable we merged together. But they are in a way of o to say like two columns. Each column represents a verbal. As you see here, we have x and x1. Well, I hope that you understand what is matrix and how to create. In the coming tutorial, we'll be focusing on DataFrame. 7. DATAFRAMES: Hi again. In the previous tutorial, we have learned how to create matrix. In this tutorial, we will be focusing on DataFrames, which is similar to matrix, but slightly different than matrix. So what is a DataFrame? The DataFrame is like a matrix with a two dimensional rows and columns structure. However, it differs from a matrix in that each column might have a different type. For instance, this is a dataframe and it consists different datatypes, such as numbers and character. There are here to character vector, names and occupation our character vectors, and here is a numeric vector. So basically a DataFrame contains different types of object and it contains as well two dimensional RI. We have rows and columns. In this case there are four rows, three columns. In contrast, matrix must consist a data element of the same type. As is true here. This matrix as well contains a two-dimensional and data arrive, but the data that are inside the matrix must appear all the same type. So basically these are the different. We could say that DataFrame is an extension of matrix since dataframe contained as well a two dimensional data RI, but it differs from a matrix in that each column may have a different types. So why do we need to use DataFrames? Dataframes are used to store tabular data in R. They are an important type of object in R and are used in a variety of statistical modelling application. Remember that a DataFrame contains a collection of vectors. In this case, there are three different vectors. The first two variables, our character vectors we have named and occupation. And this is a numeric vector as issue here, there are all of them numbers. We could also say integer. Integer are numbers that do not contain decimal value. Well, I would present a quite a few DataFrame example's, so you'll become familiar with them. Well, let's create our first DataFrame. Remember, DataFrame contains a different collection of vectors. So in this case, recreate three different vectors. We create a numeric vector onto character vector. Started first verbal. Create a numeric vector. And you create a character vector. Then we create the last. Well, these are different vectors. And the first one is a numeric vector, and the second one is a character vector as well. The third one is a character vector. Now we create a DataFrame from these vectors. So we include these vectors and the dataframe. Give a name to DataFrame. Well, basically this code creates a DataFrame. Here you see string as a factor, which means that either you can set false or true these logical values in case it's false, tells the program not to convert into factor. So if a string as factor, it's set as false, r will not convert the character vector that you see here, occupation and names. These are two character vector. So r will not Convert characters into a factor. If this argument is not specified, then by default the argument will be true, which means that will convert all characters. And to factor when we do not want that hub and we need to keep our vectors, and we need to keep our character vector. And so we set false. And in that way, R will not convert the character vectors. Well, now that we have a DataFrame, let's visualize and then you will see a tabular data in the source panel. So use view function. Then write the name of the variable, in this case y var. Well, this is tabular data and it's DataFrame we just recreated. Suppose now we want to access and the first column, I think you remember list indexing in the previous tutorial. There are different ways to access and the data that are inside this variable, var use the boxes bracket and then select the position to which you want to access. In this case, let's say we want to access the first column. The first column is under first position. We write first one, run this code. Here you see the first variable or the first column in the DataFrame. You can as well access and select the first column using this code. Well, the only difference is that in this case and the data are in a row major order. Well now let's select as well the same column, but this case we use the dollar sign. Well, these are the different ways to select or access the variables are inside a DataFrame. Suppose now we want to rename one of the variables that you have inside the DataFrame. We need to change. This variable in this case is occupation, which we'll want to rename this one. So renaming our changing the name of the variable or column, use the column name function. And then write the variable. Then box bracket. Inside the box bracket, you have to write the position of the verbal that we want to rename. Occupation is the second position. Then we need to write two and we assign the name that we want to use in this case, let's say jobs. Don't forget to write the double quotation mark and the time you are creating a character or a string. Let's see if the name is changed. Well, as history here, the name is Jane. We had before occupation and now it's job is to basically how to change the column names using this function. Suppose now we want to delete a variable. Let's say we want to delete a page to do so, use this code, right? First of all, the verbal. And then again, right? In this case we are var, which obviously the dataset that we created and restore and disbar this variable called var. Then right, box bracket followed a comma punctuation. Then we write as minus or subtracting operation. In this case, subtracting operation indicates that we want to subtract or delete a variable. In this case, age is in the third position. So we write three, which indicating the position of the variable that we want to delete. All right, let's run this code. Well, the variable has been deleted as this three here. There is no age variable inside the dataset or our DataFrame. What I hope that you understand how to create variables on DataFrames. I think now you can create your own DataFrame and manipulate a little bit and try to make some change rows or names or columns. And as we're subtracting or deleting some columns, in the next coming tutorial, we will dive into different ways of accessing a dataset, as well as saving a dataset. 8. Reading & Saving Data frame: Hi. In the previous tutorial, we have created a small DataFrame. In this lesson, I'm going to show you how to save a dataset. Well, suppose we want to save the dataset that we created in the last previous tutorial. There are different ways to save a file or a data, and it depends on the type that we want. In this tutorial, I'm going to cover the most commonly used functions for reading and saving a dataset or a DataFrame. Well, I first encouraging you to create a new folder in your working directory so that you can save and in files or a dataset during this course. Well, to create a folder, go to New Folder, and then write the name of the folder that you want to create. Let's say in this case, course files. And click Okay. Well, as the shoe here, the course files is created and this is a new folder. And the file, so anything that you create or you want to save will be saved in this file. You can also make this file that you created as your working directory go more and then click Set as Working Directory. If you set as Working Directory. And now this course files is our working directory, which means that everything that you saved is going to be inside this folder. Well, now let's create our first option, which is to save your data. In this case, we have a data called var. We're going to save this var, which is the name of the DataFrame that you created in the previous tutorial. We are going to save this dataset as an art object. Well, let's save the file. You'd save RDS function to save the variable as an R object. And this gives you write var. Var is the name of the dataset that we want to save. Then write file. Use double quotation mark. In this case you write and the name of the variable that you want to save. Let's say data one. Then the extension must be r ds. Well, on this code, you see here, the data is saved in our working directory. Suppose now we want to read the data that we just saved in our working directory. You can read a DataFrame and using read RDS function. Then write the name of the file that you want to call on you want to read. In this case, is the name of the data that we want to recall. Data one or detail, as you see here, we saved in the directory. So right file. And write the name of the file, in this case this DTA one. And then the extension in this case your RDS, you can as well save this reading dataset. In a variable is a variable too. Run this code. Well, you see the names on job, these are two character vectors and actually is the data and insight this verbal. Well. Now let's see the second option, which is to save a dataset as a CSV file, which stands for comma separated value. So CSV is the standard way to save a dataset and it can be used with Microsoft. So let's save our data in this case as a CSV file. Well, basically this code saves new DataFrame in our working directory. Run this code. You'll see that a new DataFrame or a new file is saved directory. Now you see here, and this is the second data that we saved in our working directory. Suppose we want to access or read again the CSV files in our working directory or any dataset in the form of CSV file used. Read CSV files, and then write the name of the file or the dataset that you want to recall. Always use the file because file indicates and the name of the data in this case is TTA t2. Then the extension discuss is CSS3. Well, it's very important to always set the string as a factor argument to false. I think you remember how does that affect our dataset? If we set false, actually tells the program not to convert and it character vectors into factor. You can as well save in a variable so you can use that verbal attorney time. Let's say this case var three, it run the hood. Well, this is the simplest way to read a CSV file from a directory. One last thing that I want to emphasize is that in case your data is extremely large, there are few things that you need to consider. You need to make sure that the size of your data is not bigger than the amount of RAM on your computer. Because our stores all objects in physical memory, which is the RAM of your system. For instance, if your dataset contains 1.5 million rows and a 150 columns. So that dataset would require about, around roughly, let's say, 1.34 gigabyte of RAM. But most of computers these days have listed at much RAM. However, you need to be aware of. You need to check as well. If there are other programs that may be running on your computer and using up much RAM. 9. For loop: Hi. In this tutorial, you will learn how to write a basic loop. In our, generally speaking, a loop or for loop is a way to beat a sequence of instructions under certain conditions. So loop is used to automate parts of your code that are in need of repetition. To understand more about loop, Let's take the following example. Suppose we want to iterate or repeat over the elements of an object. Let's create first of all, an object of type vector. Well, now we want to iterate over the elements of this vector. So the length of this vector is six, meaning that there are six elements in this vector. So let's create a for loop and see how that works. This loop takes an iterator variable and assign its successive values from the vector. The eye that you see here is called iterator variable. And this variable is repeated based on the length of the vector. Run this code, we're going to see a numerical output. And the length of that numerical output will be based on the length of the variable, in this case is x. So let's run the currency. Well, you see here there are six numbers and this is the length of the vector. And the print that you see here is the code. And actually this code is inside the curly braces are the curly brackets. So print, actually prints the AI. Ai is Detroit or variable. The iterator variable is repeated, are iterated. And based on the length of x. X is a numerical vector, or the length of this numerical vector is six. For loop can be created as well without declaring a variable just to, we can write a sequence of numbers directly inside the parenthesis. So let's say that 1 second. When FOR loop is executed, the eye will be printed or iterated based on the length of the sequence. In this case, there is a sequence, a number between one to six. Well, you see these are the numbers. The difference is just that this is constant numbers. The first one is a vector. Well, we can also make some mathematical computation. Let's add blustery. You see now three is being added to each number. We have different numbers now because we added three to each numbers using this code, we added Detroit or verbal plus 3. Well, we can also create a for loop without curly braces just to all in one line. That's the Tao. And say, this for loop does not contain any curly bracket. We just used all of them in one single line. It's a very simple code. The print is, is the code that's going to be executed based on the condition within the parenthesis of the for loop. X is the variable that we declared in the first time. Keep your code separated. Use the hashtag, right? Comment, let's say first loop. Second loop. And this one. Well, comment is a piece of text paste within a program. And the program actually will not run. So it will ignore just, it helps to clarify and give some description. Let's now create a for loop with an if statement. Suppose we need to print all odd numbers between one and 20. Remember that number is any number which is not divisible by two. Well, here is a for loop with an if statement. The eye that you see here is the iterator variable. You can name it as you want. And the following code is the sequence of numbers from which we want to print the odd numbers. Then there's IF statement within IF statement, the parenthesis, we have the conditional statement. If this conditional statement is true, then this code is executed as you see here. Next is the code that's going to be executed based on the condition. This code select all the numbers that are even. Even numbers are the numbers that are divisible by two. Double percentage is the operator and it returns 0. If you use this double percentage between an even number and two, then the result is always 0. So we do not want to select even numbers. So next actually skips all the numbers that are even. Last. Print will print all the numbers are not divisible by two. And to select all the numbers that are divisible by two, also called even numbers, just right. Exclamation mark in front of the iterator I. So in this case the next, which are obviously a code that's going to be executed if this condition is true in this case. Next, we'll skip all the numbers. Don not divisible by top. These are the numbers that are divisible by two, also called even numbers. I hope that you understood the different ways to create FOR loops. And I look forward to see you in the coming tutorial. 10. If else Statements: Hi. In this tutorial, you will learn conditional statement. Conditional statement. Our conditional expression is a programming construct where a decision is made to execute some code based on condition. To understand more about IF statement or the conditional expression, Let's take the following example. We create two variables and then we use the logical operators within this statement. And then the code will be executed based on the condition that x, say 500. Well, create a conditional statement. Well, this is a conditional expression or conditional statement. And the thing that is inside the parenthesis is the statement or the conditional statement. And this is the code. This code will be executed based on the condition that you see here. So always the code is inside the curly brackets, also known curly braces. Well, this code can only be executed if this statement is true. In this case, x is greater than y. So if this statement is true, then this code will be executed. The code that you see here is within the print function. Print function prints the characters that you see here. So let's run this code and see and find out whether this code is executed or not. Well, as the code is executed because the condition is true, the conditional statement in this case, x is greater than y, is true because 500 and is greater than a 100. Well, now let's change this variable, make it 1000. And we see, we run again the statement, and we see if there is something changed or not. Well, as you see in the statement, there is no code that is being executed. Because this code can only be executed if the condition is true. In this case, the condition is not true and it's false because x is not greater than y. In this case, x is less than one, but it's not greater than y. So this condition is false, then this code will not be executed. Well, in that case, there is another statement called L statement, and it's used to run a code in case the first conditional statement is not true. Let's find out what that means. Now let's run these statements and see which code will be executed. Well, as you see, x is not greater than y and obviously is true. And this is the second code that's being executed because the first statement, which is x is greater than y, is false, is not true, then the first GOT would not be executed, but actually the second code will be executed. Remember that anything within the curly brackets or curly braces is the code could be any code that you wish to write inside. And the parentheses inside the parentheses is always were to write the conditional statement. You can also use if else statement. So in one function you can use as well if else statement. Well, you always write the conditional statement, furloughed the code that's going to be executed. Let's write exactly the same stem conditional statement, excreted and y. And then we write the code, in this case symbol character. Well, in this statement, it's exactly the same, just that the difference is, is the form of the function. So if this conditional statement is true, then the first cut will be executed. If it's not true, the second code will be executed. Let's check out that. As you see here, now, is executed in noise in the second position. It could be any code that you write in that position. You can also create a multiple conditional statement to create a multiple conditional statement use else if. Go to the first example, just we want to extend the conditional statement in this case. That's right. Else, if then we write the conditional statement, let's try this time, x equal y of n and the curly brackets or curly braces to include the code. Well, you need to respect how we actually create and the curly brackets on an order curly braces as well, the parenthesis. So make sure the code is clean and that way you can run on Find a result. Well, in this case we include another condition. Let's say it much is x is equal to y. In this case, if X and Y are equal, then the code will be executed surround the scope. Well, as you see here, the last code is executed because the first if statement is eight is greater than y, not x is 500 and y is one. And the second statement, it's x equal to y as well as forced, not true because x five hundred and one hundred, so they are not equal. In that case, the last stood, it's being executed because neither the first step condition, not the second one is true. Well, I hope that you understood different ways to create a conditional statement and I look forward to see you the next coming tutorial. 11. Functions (Part1): Hi. In this tutorial, you will learn how to write a function. A function is a group of instructions that takes inputs and returns a result. Function is a piece of code written to carry out a specified task, and it is the most essential programming construct in R. Well, R has many built-in functions which can be a directory called in a program. But we can also create our own function. To understand more about functions, Let's create our first function. Function are created using the function directive and then our store as our object, just like anything else. Well, let's create a function and you have to give a name, the function that you want to create. It can be easy to recall again, Let's say our first function. And then we write inside the parentheses, the arguments. Argument is like a place holder. For some values we want to pass to the function. We will see that in a second. Let's, let's write a. In this case, a is a, is an argument. Then followed. Curly braces. Within the cooler precious we write the code or is called as well. The body of the function is where actually to write the code or destruction that's going to be executed within the function. So in this case, let's say a divide by 2. Well, let's run this code. The function is created. Let's take an example to understand more how this function works. So this is very simple function. Anytime you want to use the function, you have to write the name of the function that you created. Let's say in this case, there are from our first fine, you just see, it appears. Then within, within the parenthesis, we write the argument. We already create an argument. This case we have only one argument. So any argument we pass to the function will be divided by two, let's say 10, and then the result will be 10 divided by 2. So the argument that we are passing to the function, in this case it's 10. Well, now we create a function with two arguments. Remember that argument is like a place holder for some value that we want to pass to the function. Let's say it's second function. And then within the parenthesis would write the name of the argument should be any names that you want to write. This case, let's say a, B, c. Then as you usually write the collaborations within the curly braces, we write the instruction or the code that actually going to be executed when the function is called. This case, let's say B plus C and then divide by 2. Run this code. Well, the function is created. Now let's have a look how that function works. We take a very simple example. Remember this function takes two arguments, which are B and C. In this case, let's say 89. Remember as well too, right? The comma punctuation to separate the two arguments. In this case, this two arguments will be passed to the function, and the function uses them to carry out this instruction. This case is b plus c divided by two, so it will be 8 plus 9 divided by 2. Run this code. As you see, the result is towards 0.5. Actually this derived from this instruction, which is B pluses in this case will be 8 plus 9 divided by 2. You can also save var and then return to return the value within their turn, right? The variable in which you saved the code. Well, this fraction is the same. This is no different than what we had before. And this one, in this case, we just write return and to return the value or to return the verbal. So there's no difference. Let's say again. You see 12.5. Well, if action can be created as well without any argument, let's say I want the second. Well, there is no argument within the function. We just include the code or the instruction. And dysfunction as well has a for loop. So if we call the function or the name of the function, what will appear is the instruction inside the function, which is this case for loop. And within the following, we have the code which is obviously the print. So the I would be printed, which is the iterator variable as you see in the previous tutorial. Well, these are the numbers on this function does not contain an argument, so we don't need to pass an argument to the function. 12. Function Part2: Hi, in this tutorial, I'm going to talk about different kinds of functions such as lazy evaluation function with Lou and a function within another function. And a function that checks if a number is an odd number or even number. So generally speaking, AND function is used to carry out certain tasks and such as specified task. And very useful to use a function. We have different kinds of function in R, But you can create our own function as well. Let's start first of all, to understand little bit about what is lazy evaluation. This function never actually uses the argument b because we did not specify the argument be inside the body of the function. So now if you call the function, we will only need to pass to the function, the first argument, because the second argument is missing, we didn't specify it within the body of the function. That case, you only need to pass an argument to the function, so that function will be evaluated and that argument. So any number we write inside the function, we'll get positionally matched to the first argument, which is a, and that will be an eight divided by two. So let's see. Now if I write and within the function five, and that means five, we get positionally matched to the first argument, which is a. Because B is missing, we didn't specify. So the result will be five divided by two. As she took 15 is 5 divided by 2. But if I write the second argument, you will find the same result. Because five gets positionally matched to a. And the function ignores the second argument, which obviously is missing because there is no instruction that are tending the function to evaluate the second argument within the body of the function. As just you all read the result is the same. Nothing has changed because six is the second argument, even though we specified calling the function. In this case six. Suppose six is the second argument, but actually the function with another value it, because the argument is not specified within the body of the function. In contrast, if we write or specified the B or the argument b inside the function, let's say a plus B. Run again this code. If you call the function now, and let's say 5, the same example. Now there will be an error telling that the argument b is missing. This error tells us that the argument b is missing and oversee dismissing we were calling the function, we just mentioned the first argument. We didn't specify the second argument. So in this case, five gets positionally matched to the first argument, which is a. Actually this tells us that the argument b is missing. In contrast, this example that is true here, we have two arguments, but in that case we didn't get an error because the function ignores the second argument. So that how the lazy evaluation warps. So lazy evaluation is when an argument is evaluated only as needed in the body of the function. Well now let's see a function with a loop or a for loop. This function contains four loop. Iterator is iterated based on the length of the sequence. So the sequence is 1 to n. N is an argument that's going to be passed to the function. So let's see down called the function. So n is the argument of this function, that's a six. Well, these other numbers using the arguments 6. So this argument will be passed to the function and the function evaluates this argument. Within the function, there is a for loop. The for loop contains as well print. We save the print inside the result, and then we return the result inside the function. Well now let's see a function within another function. This function contains another function, and there are two arguments, x and y. X is inside the first function or is the argument for the first function, and y is the argument for the second function. The instruction that you see here is x times y is inside the body of the function or the second function. And this function, well, this function contains another function and the way we pass the argument to the function is a little bit different than the other functions. Let's sit down in a second. First of all, we save the argument, the first argument x. And the second argument is why? So first we save the first argument in a variable, Let's say that variable result. And then we write the function func2. Within this fraction, we say the first argument or the value of the first element, this, in this case 10. And for the second argument, we write within the result. Make sure as well to write the double parentheses. And let's take as an example of the value of four, the second argument in this case, let's say 10. And this is the result, 100, which is x times y. And this case the two arguments, one is 10 and the other one is 10. So it's 10 times 10. The result obviously is 100. And this is how we pass the arguments to the function when there are different function or nested function or a function within another function. Well, now let's create a very interesting function, the basic function. And that function will check if a number is an odd number or an even number. Well, basically this function checks if a number is an odd number or even number. This function it contains as well. If statement, the condition that you see here checks if the number is an even number or not. In case this condition is true, then the code that you see here is executed, which is based within the paste. And there's a verbal. You see why? Why we declared before if statement and this y represents n. N is the number that we will pass to the function and actually is an argument. In this case, we concatenate a variable that is she here. Why, why we declared before the if statement and we saved this argument. And this is a little bit commentary or a comment to describe whether the number is an odd number or an even number. So let's run this code. Now. Let's have a look the function. So we check if eight is an even number or odd number. Well, you see eight is an even number. Even number is a number that is divisible by two. Let's check again if seven is an even number or number. Seven is an odd number. Well, I hope you found this tutorial useful and I look forward to see you the next tutorial. 13. Detecting NA's: Hi, Welcome to data science. In this lesson I'm going to talk about missing data. Missing data are missing values occur when no data value is the star for the variable in an observation, missing data present various problems because missing data reduces this ascetical power and can have a significant effect on the conclusion that can be drawn from the data. The problem of missing data is relatively common and can have a significant effect on the quality of the data analysis. For instance, this tabular data has some missing values. Missing values are denoted by a, as you see here in this case, in a indicates a missing value. So if you see in a dataset these values NAs or N, A, and remember that they are telling the missingness of the data. It is difficult to assess the cost of missing data. Even a small percent of missing data can cause serious problems with your analysis and leading you to draw a wrong conclusion. So missing data occur during data collection. For instance, in our research, missing data may occur due to human error. For example, a researcher might forget to take a measurement such as Bayesian is heart rate. So missing values can happen as well due to equipment failure. Database as well are prone to missing values. It is more common to see a data with some missing values. In this tutorial, you will learn how to detect missing values in a dataset. And in the coming tutorials, you will learn more about how to solve missing values. Well, let's now check missing value in a dataset. We use a built-in dataset are available in our data. This dataset is about daily air quality measurement in New York from May to September in 1973. You can read more about this data by using helped function. Well, now let's summarize this data using the summary function. Let's run this code. Well, dysfunction summarizes the whole data. You see here the different variables in this dataset. You can find out and see the numbers of missing values in each column. In this case ozone variable, there are 37 NAs as you see here. So this indicates the missing value in this particular variable. The second variable, seven NAs. The other variables, they do not contain any missing values. We have only two variables missing values. There are other things that you see here such as max, Min, median. At the moment, just to consider using this function summary, we'll summarize or the data and you can check in that way and understand how many NAs in each column or in each variable. There are different functions to check whether there are missing values in a dataset. The following code can be used to check if there are missing values. What this function sum is going to add together the missing values that are present in the dataset. In this case, our data is called data. We saved air quality in this variable called data. Within this function sum, we have another function as well, is in a active dysfunction. Check if there are NAs in this data around this code. Well, the total numbers of n in this data is 44, as you see here. So this is the simplest way to calculate all together the numbers of missing values that are present in this dataset or any given data. Let's say now we want to find out how many missing values in each column or each variable in this dataset. Well, use the following code to access it. In this code, we have a function that calculates the numbers of missing values that are present in this dataset. As you see here, there are missing values in the first two variables. Hopefully, variables, they do not contain any missing values. Well, I think you understand the different ways to check the missing values in a dataset and the coding tutorials, you will learn more about how to solve the missing values. And you will be equipped the skills that are needed to deal with it. 14. Tidy datasets: Hi, In this lesson you will learn tidy data-set. Tidy data is a consistent way to organize your data in R. So getting your data into this format requires some upfront walk, but that workspace off in the long term. Well, a tiny dataset has some rules. Generally speaking, there are three rules which make a dataset tidy. Tidy dataset is very important when you want to analyze the data. If the data is tidy, then it's so easy are much more flexible than the data Dar not tiring. Well, there are a few rules which make a dataset tidy. In this case, the three rules are if each verbal has its own column and the observation has its own row, and each value has its own cell, then the data is tiring. As you see in this example, this is a dataset. So each variable must have its own column, as you see in this case, sepal length, sepal width, petal length, petal width species. These are the different variables in this dataset and each vertical has its own column. The first column, the second, third, fourth. So each variable must have its own column. And each observation, this observation, the observation is the row. Each observation must have its own row. This observation is the first observation and each division has its own row. And you see it is a different rows, then each value must have its own cell. These are the rules that dataset must have and each value must have its own cell. These are the values of the dataset and each value has its own cell. As you see, these are the different values in the dataset. Well, a tidy data set is very essential topic to understand because if you, if your data It's tidy, then the analysis is much more easier than if the data is not a tidy data set. Well, now let's have a look. This dataset, which one do you think is tidy dataset? Which is not a tidy dataset. In this example, this is a tidy data set because each column or each vertical has its own column. In this case, country year cases, population. Each vertical has its own column, and each observation has its own rows. Just to in this case, then there are different rows. But this data is not tidy because variable type, there are two verbals, dark joined together, which are cases and population. So this dataset is no longer tidy dataset. And as well, this dataset is not tightly because you see in the right verbal, there are two variables that are joined together, which is cases, our population. And you see this to separate the two variables. Well, we're going to clean this dataset and make it tidy. Because the tidy dataset must have the rules that we talked about which make a dataset. Each variable must have its own column and each observation must have its own row. And then as well, each value must have its own cell, as you see in this case. But this dataset are not tiring because there are two different variables in one single variable type. And then this one as well. The dataset or the variables are joined together. As you see, these are the cases and is that the population. So cases and population are joined together in one single verbal or in one single color. Well, let's have some practice and make it tidy. Well, first of all, we load the package tidyverse. This bucket contains different datasets, which are very small datasets. And lets you understand what is tidy data set and what is not a tidy data set. The table one that is shown in the previous one is a tidy dataset. This dataset is tiring. Each column has its own, each variable has its own verbal and each observation has its own rose, is at a screen and is tired. But the second table can find this dataset in this package. The second date table true is not tidy and this data is not tired as you see this variable type. They are different verbals in this verbal cases population, these are two variables, but they are joined together in this variable type. Well, you see you can understand the difference between these two data. The first one is tidy data as you see each verbal acids column. But in this case, the type, this variable has two variables joined together. Cases in a population. To make tidy, we use printf function. This function can take several arguments. First of all, within this function, we write the data that we want to spread. Spread is used to when lucky in this case, when we want to spread these values, cases and population. So we are going to make two different variables in that with the data will be tidy. We write the name of the data, which is table 2, then key our values, our two most important argument that this function takes. The key, we write the variable that we want to spread, which is in this case type. So type double quotation marks. And the value we write count. So actually count. It counts the different cases or the different numbers that each variable contains. As you see now the data is tired. We have cases and population. You can see the difference between this one, this one and the previous one was not tidy dataset, but now the data, it's tied on the screen. Each verbal acids earn column. They are no longer joined together in one single variable cases its own column and population columns. Well, this is the way we clean or we make the data type when the bird, when two variables are joined together inside a single verbal using spread function. Well, now I want to introduce you a very powerful way to write exactly the same code, but slightly different using the pipe operator. So this is the pipe operator. So it's very handy and it's a quick way to write a symbol code. As you see in the spread function, we included the dataset or the name of the dataset, which was table two. But using the pipe, we don't need to include inside the function the name of the data, but just to use the pipe operator. Before the pipe operator, we write the name of the dataset that we want to manipulate or to change, to arrange, to reshape. And it kind of operation that we want. So in this case we write the name of the dataset, that's this table to. The pipe operator is a quick way to write a symbol code. Now we can write the spread function. And this pipe operator connects the data and the function. Well, as you see, these two datasets are equal. Both of them are now tied it. The difference is that the code we used in the first code, the spread function that has the data inside. But in this case, we didn't write the data inside the spread function, but we use, we use the pipe operator, which is very powerful operator that's heavily used in the dplyr package. Well, now let's have a look. This data table tree. This data is not tidy as you see in the right variable. There are two variables are joined together, cases and population. And the forward slash separate these two verbals or these two values. The right contains two variables. So to separate these variables, to use separate function. In that way, we can make this dataset tidy. Let's write the table 3, which is the dataset. And then we use the pipe operator. I think now you understand how to use the pipe operator with a function right after the name of the dataset, right? The pipe operator. And then we write the function separate, which is the function that we want to use to separate these two variables. Inside the function. We write the verbal in which we want to change or to separate, which is in this case, right? Then we write into this argument into ticks and the variables that we want to create, which are in this case the two variables that we want to suffer it. And they are cases and population. Cases in a population are separated by the forward slash that you see here. On population. Well, unexpected, yep, I forgot to write the C function, but so safe vaccine actually concatenates and the different values or the different variables which are in this case, cases and population. And then we write the separate or value, the forward slash. Now the data is tidy the cases as its own column and the population has its own column. 15. Welcome to Data Wrangling: Hi, I'm glad that you chose discourse. Welcome to data wrangling. This course focuses on preparing you for selecting and cleaning data for downstream analysis. Well, now that you have some experience with, are we owe to be able to jump right into the practical aspect of data cleaning. One of the major components of data scientists, our job is to collect and clean data. The first step in any data science project is getting, cleaning, and understanding the data. So data wrangling takes almost 80% of the time, and the remaining 20 percent is for modelling and exploration. Once you have mastered, you will be ready to manipulate and it dataset that you want to analyze. You will also learn about the techniques for exploring and summarizing data. We will start with some motivating examples. So you can see the bigger picture and then dive into the details. We will be focusing on the essential, so you can gain as quickly as possible the essential skills that are needed to become a data scientist. 16. Installing Packages: There are some prerequisite packages that you need to install. You need to install tidyverse packages. You will also need to install the main dataset for this class to install a new package, right? Install packages. And then within the parenthesis, right, double quotation mark, and then right as well, the name of the package that you want to install. In this case, we want to install Tidyverse. Tidyverse contains many different packages that we'll be using during this course, pressed into an IRA and install this one. I don't need to install again, but you need to install this one. And the second part is that we want as well, is called NYC flights, which done for New York City flights. Hit Enter. Make sure that you have internet connection. Well, once you install these two packages, you need to load. So USU library library takes the name of the packages, then it load packages in the environment. So this box is called the supplier. Suppliers can be found inside the tidyverse. As I said, tidyverse contains many different patches. And this is the second bucket. It's called New York City flights 13. Press Enter. Well, now we have loaded the two packets and they are ready to be used. When the dataset that we want to use, this class is called flights. You can read more about this data by writing help function. Then write the name of the dataset, this dataset called flight. Well, in the next tutorial, we will be focusing on subsetting this dataset using the prior or this package. Before that, I want that you read a little bit more a lot. This dataset. So it will be easy for you to manipulate the different variable are inside this dataset. So this dataset is on time data for all flights that departed from New York City too different, and cities in the United States. To view the data. First of all, we save this dataset in a variable, let's say this were my data. You can named, give it a name and the name that we want to give it. This way, Let's say my data, the right, the name of the dataset, this case is flight Is view function. Head is another fraction you were useful when we want to select the first 100 and rows, then write the name of the dataset. This is my data because we see it in my data. Then write n. This n tells how many rows we want to see is you can select any numbers, say between a hundred and two hundred and five, the first five rows, that first 10 rows, as you want, Let's say this case a 100. And as you see dysfunction view returns a tuple of data is dataset contains different variables you can read, such as months, the budget of time, scheduled departure time, delay, and these values are present the data. You can read a little bit more about the description and health function. You cannot do. I'll see how many rows and columns in this data using diem. And then write the name of the dataset, in this case the mediator. What the lm function returns that they mention of the data. The numbers that you see here are the numbers of rows and this one is the number of the columns. So the infection is used for, it lets you understand that they mentioned on the data. You can also use n call to get the number of columns in the dataset. As you see, there are 19 columns in this dataset. You can also get the number of rows using n rho. Then write the name of the dataset. But this number is the number of rows. I hope that you understood the different ways to see that dimension of data using DMM or n call enroll as well how to visualize and the tabular data using the view function. In the coming tutorial, we'll be focusing on subsetting, transforming dataset using the package called the briar, the package that you already installed. This package is inside the tidyverse package that you install. 17. Data Subsetting filter function: Hi, This lecture is about subsetting data. Subsetting is a very useful indexing feature for accessing parts of the data. In this lecture, we're going to use d prior, which is a grammar of data manipulation that enables the DataFrame manipulation in a user-friendly way. These are the five key d prior functions and they are used to manipulate the vast majority of data. Filter function picks observation based on their value. Arrange is used to change the ordering of prose, mutate as well. Add new variables to the dataset. Select peaks variable based on their names. And this one summarize, reduces value down to single summary. Well now let's have some practice and understand more about how data subsetting is used. Well, let's make some sub-setting. As I said, subsetting data is a very useful indexing future and for accessing parts of the data. So we're going to use the supplier such as filter function and arrange. We start first of all, using the filter function to select the rows. We start loading the two packages as well. We say the variable, we save the data in your variable. If you already loaded the Pakistan, Israel, you saved the data flight in a vertical. That's fine. Well now let's see how this dataset contains using the names and then write the name of the dataset. In this case, I saved this dataset in a variable called mydata, but you can name any name that you want to give it. Well, these are the variables inside this data. This data contains 19 variables and these datas are what? Own time data for all flights that departed from New York City in the year of 2013. We can look at the structure of the data using STR function and then write the name of the data. This function returns the structure of the data. This dataset contains different class or different types. As you see in this case, integer, INT integer, or stand for integer. Integer is any number that do not contain a decimal value. This is a numeric class or numeric object. Numeric object may contain a decimal value. This is a character class, the carrier, the flight. So using STR function you can understand and the different types or the different objects that this dataset contains. Well, now we can use the summary function. Using some refraction, you can gain more information about the data. For example, the year is a verb. So the minimum value in this variable is two to the 13, and the maximum value in this variable is 2013. And obviously the year is all of them. And because the data is on-time data for all flight that departed from New York City in 2013. So the minimum, the maximum should be to the 13th. You can as well see which variable has missing value. For instance, this variable contains 8,255 missing value. So in a son for missing values, this variable Israel contains 8,255 missing values. And this one as well, the arrival time variable contains 8,713 missing values. This one is the arrival delay contains 9,430 and a, which stands for missing values. Well now let's make some sub-setting. We subset and try to select the different rows in this dataset. When the following code can be used to access all the flights that departure that in the month of February and May using filter function. Filter function is suited to subsidy observation based on their values. Within the filter function, we write the name of the dataset and then write the name of the variable of interests and discuses mom speaks, we want to select February and May. Well, this SVC function concatenates together these two values to represent February and five represents, may. Run this code. As you see it, 53,747 flights. Suppose now we want to add another month and let's say September. We can write inside the C function, c faction links together the different values. So nine is the month of September, run this code. Well, 81,321 flights. Let's now select all the flights that part in the month of January. 27,004 flights that the bartered in the month of January. We can also select all the flights in the month of January using the base R code. This is base our code and this one is from the filter function on filter function is inside the bug is D Briar, but the supplier is much more flexible and is widely used for data subsetting. You see the result is the same, 27,005 in that month. Does at the different is the way we write the code. Well, suppose now we want to select all the flights and that departure on the second of January. Well, in this case you have two variables. We have mumps and D, and we can write both variables inside the filter function, and we just write the common punctuation to separate. And there are two variables. This case month. Month should equal to one day. To remember to write two equals sign. If you write one equal sign, you'll get an error. So make sure to write two equals sign. Make sure to write the variables as you find in the dataset. If you do not write correctly the variable, you will also get an error. Well, 943 flights. Now let's add another variable in the filter function. Suppose we want to select all the flights and that departure and on the second of January to the destination of Miami, Let's find out my Amin datasets. This variable contains all the international airports we want to select. Miami International Airport. You see M sun for Miami International Airport. Well, that's fine. Flights and are going to this destination. So now we have three variables. Make sure to write always the dataset. The name of dataset is the first argument that this filter takes. Then we write the very first case. We have three variables, months. Msl should equal one. The Then the third verb and dust. If you forgot double quotation mark, the program will return her. So this is exactly how we write a character or string runs code. What you see 31 flights on the second of January to the destination of Miami International Airport. Suppose now we want to add another air for name is go back to the dataset in the destination variable. Let's say we want to add this error for ATL, which stand for Atlanta airport. Let's find out. Now the destination contains two characters. In this case, we write the C function to link together the different characters. Make sure that the c function is a lowercase. Run this code. As you see, there are 38 flights on the second of January to the destination of these two Airforce. 18. Data Subsetting arrange function: Well, now let's see how arrange function is used to arrange function works similarly to filter accepted that instead of filtering rows, it reorders them. Arrange function takes us well the name of the dataset. So my data. And then we write the variable that we want to reorder That's indicates departure time. Or on this go. Well now the order of this variable is changed as you see, it started from the first value, which is one. If you look at before the arrange function, they do not have any order. But in this case, the order is changed. It's Saturday from the first value to the last value or to the last number in this case, one is the first value and distributor. We can also use the send function to order a column in a descending order. Let's see now in a second. What in this case we use the send function to sort a variable in a descending order. And this is the name of the variable that we want to sort R to reorder. As you see now, the order is changed in a descending order. Are we sorted this variable in a descending order such from the last value to the first value. One last thing that I want to mention is that this dataset is not a DataFrame. It's called typic data types data have a refined print method and that only shows the first 10 rows. So far anytime we run the code, we only getting the first 10 rows. And as well, the dataset has some description about the type of the variable. As you see, each variable, there is a little bit description, month is an integer, d is an integer. 19. Dplyr select & rename Functions: Hi, in this lecture is about dplyr functions you are using. The plot function is to subset data. In the previous tutorial, we learn different kinds of subsetting data sets, such as using filter function on our range function. In this tutorial, we will learn some of the deep layer functions such as select function and rename function to rename a variable. First of all, you need to load the boxes and save the data's run in a variable. Well, to rename a variable, use the rename function. Let's see first of all, which variable we want to rename using names function, and the right of my data, which is the dataset that I saved in this variable. Suppose we want to change this verb desk with destination. So to change your verbal or a column, use the rename function in the prior package. Write the name of dataset, then write the new variable. So destination is the new variable that we want to change with desk and desk. This one. This one is the variable that we want to change. You can save this code in the dataset. So you can see if the variable is changed or not using the names function, the right data on this road. Well, you see destination. Now we change the verbal before it was dest. Suppose now we want to select some variables from the dataset. Let's say we want select year, month, and day. So we use select function. And within the select function we write the name of the dataset. And then we write the variables that we want to select. This gives us year, month, and day. Run this code. You see there are three variables that we just selected these three variables from the dataset. So select is very useful and can be used to subset or to select the different variables. And specifically, it's very useful when the dataset contains handwritten or even thousands of variables. Suppose now we want to select all the vapors in between year and flight. So in this case we write the two variables which is here and flight. So this code will return all the variables in between year on flight runs because as you see there are 11 variables and these are the variables are between year, flight, weight. Now we want to select all the columns or all the verbalise accepted those from distance to air time. In this case, we write and the subtraction sign or the minus sign. When I run this code, I just see there are only 17 verbals. We select all the columns or all the verbalise accepted these variables. And if you want to select all the variables except one verbal, let's say in this case we want select all the variables except year. So in this case you just write the minus sign and then the variable that we want to leave behind, which in this case, what do you see now they are only 18 workers. So the verbal years missing because this code returns all the verbalise except the verbal year. 20. Dplyr summarize groub by mutate: Hi, In this lecture, we're going to talk about summarize and group IF functions summarised on group by functions together day from powerful tool and change the unit of analysis from the complete dataset to individual groups. To understand how groupby function and summarize walk, Let's see the following example. Suppose we want to group the data by some variables and then operate on the data big group. Let's say in this case we want to count all the carriers single day. Well, let's save the groupby function in a variable and see how we can summarise the data. The first argument that this function takes is the dataset. In this case, my data is the name of the variable I saved the flight data in this dataset. Then we write the variables that you want to group. So group by function, grouped the data by some variables and then operate on the data by group. In this case, there are four variables. Year, month, and day on carrier. Carrier is the airline. Well, now we can burn dysfunction with summarize function and then we count how many carriers in a single day. Because we already grouped the data by these verbals, year, month, and day carrier. Well, this function takes two arguments. The first argument is the group, the data, which is this variable. We save that groupby function in this variable. The second argument is a function count, which counts the number of carriers are the numbers of the group, the data. We will see that in a second run this code. Well, this is a grouped data because we use the group by function, which takes data and then it converts dataset into a verse or a grouped data. Well, these are the four variables that we included in the group by function and group by function converts or grouped data by year, month, day. Car. Well, count comes from the summarize function because summarize function create a new data which has one more rows. For each combination of grouping. Variable is our grouping variables. And counts actually counts the numbers of each carrier in different days or months. For example, carrier nine. There are 28 in the month of January or let's say month one and D1, as well AS there are two carriers and as well B, C, D, L. These are the different carriers, which are the airline codes. Well now let's add another variable in the group by function, let's say in this case destination. So I'm now using destination. We can't count the different destinations, the different days. Well, this is the destination. And now groupby function and group the data by these verbals. Year, month, day, carrier destination and counts comes from summarize. Because summarize, That's another column to compute or to count the numbers of destinations. On the 1st of January 2013, the cardia name II, there's only one flight to the destination. This is the airport name being a. Well, now let's add another new variable in the dataset using deep layer package. Let's see in the dataset. So this vertical distance, so distance is in miles, we create a new verbals, we convert into a kilometer. And so we add a new variable. Well, we use mutate. Mutate is used to add another new variable in a dataset. The first argument is the dataset, which in this case my data. Then we write the new variable that we just want to create, let's say distance 2. We want to convert the miles into kilometers. So one mile is roughly equal 1.6935. To convert at a variable, we write distance variable times 1.6935. Run this code. Well, the new variable is created. Now there are two variables in this dataset we can access and the new variables that is first of all, save this one in the dataset. Because we added a new variable. So we need to say in the dataset, so maneuverable and that we are going to be added to the variable or to the dataset and use the select function to access the new verbal. Let's try it. The previous verbal and nonverbal. So this one is distance in miles on, this one is distance in kilometers. And highlight all of them and then run. You see these are the two variables and distance in miles and this one is distance in kilometers. We converted the first distance to two kilometer. 21. Statistical Learning: Hi, This lesson is our statistical learning. Statistical learning refers to a vast set of tools for understanding data. Well, let's take the following example so that you understand what is the statistical learning this dataset is or what the income survey for males in center and planting region of USA. In this dataset, the output variable or the variable that we want to predict is wage. In the machine learning field, the output variable is the variable that we want to predict, also called target variable. The remaining verticals that issue here such as age, race, education. These are the predictors river and are used to build the model or to predict the wage. So in the machine learning field, the output variable is denoted by y. And the input variable, or the predictor variable, is denoted by x. In this case, there are different predictors that are used to predict the output variable. So the firm will be X1, X2, X3, X4, XP. You need to be familiar with the different notations that are used in the field of machine learning or statistical learning. As I said, Why is the output variable or the variable that we want to predict? So the output variable is denoted by y, is also called response verbal target variable, or dependent variable. And the variable that is used for the prediction or to build the model is called independent variable, also known as predictive verb, our input variable. You need to be familiar the different names that each variable has. Y is the output variable, also called response verbal, target variable, dependent variable. This variable is the variable that we want to predict, and it is the independent variable, also known as predictive variable. Here is the relationship between the output and input variables. This relationship can be written in the following form. So this formula is the relationship between the output and input. Y is the output variable, and X is the input variable or the predictor variable. So y is a function of x plus some error term. This is the error term. F here is an unknown fixed function of x, which is the input or the predictor variable. In this case, f represents the systematic information that x or the input verbal provide about why which is the output variable. The error term tells how good the prediction is. So in real world, the independent variables are never perfect predictors of the dependent verbals. Even though if we use the verb or the predictor variables to predict the response variable, the prediction will never be perfect. So this error term tells how we can be certain about the prediction of the data. This function is unknown function, but we can estimate this function using different statistical tools. Statistical tools can be classified as supervised learning or unsupervised learning. Supervised learning involves building its statistical model for predicting or estimating an output, which is the target variable or the variable that we want to predict based on one or more inputs. Inputs are the predictors or the variable that we use to predict the output variable. In contrast, unsupervised learning, there are inputs but no supervising output. In this case, the algorithm tries to make sense of by extracting features and patterns on its own. So the goal of statistical learning refers to set of approaches for estimating this unknown function. Remember this function represents a systematic information that the predictor versus provide about the output variable. In another word, this function connects x and y, which is the output verbose. So basically this function connects the input variable and the outer verbal. And it's an unknown function. But using different statistical tool can be estimate the value of dysfunction. In the next Can we tutorial, we will be focusing on some of the key theoretical concepts that arise in estimating dysfunction, as well as the tools for evaluating the estimates obtained. 22. Why Esimate F: Hi. In the previous lesson, you have learned that statistical learning refers to a set of approaches for estimating the unknown function. Which is the function that represents the systematic information that the independent variable or the predictor variable provide about the output vertebra or the respondent. In this lesson, I'm going to talk about why do we need to estimate dysfunction? There are two main reasons that we may wish to estimate. This function. Prediction and inference are the main reason for estimating dysfunction. Well, let's start first follow with the prediction. In the previous tutorial, you have learned that the relationship between the output variable and the input variable can be written and which is y, is a function of x plus some error. We have mentioned that y is the output variable. This function is an unknown function, but we can estimate this function using different statistical tools. So generally speaking, dysfunction is estimated if we want to predict an output variable. For instance, let's say we want to predict the wage of a person that's not available in this dataset. This dataset is, or what the income survey for males in Central Atlantic region of us. Suppose we want to predict the wage of a person does not work for this dataset. But we know, let's say that that person we know the age, we know when the rays of that person, the dedication, the region, the job class, the hells, except this verb or what is the output variable. And we want to predict the wage of this person, which is another bourbon distillation. In this case, we can use or build a model with the observed variables are the observed points, which are these vapors that you see here such as year, age, race. These are the predictors verbal and can be used to predict the wage verb, which is the output variable. So this is typically the prediction scenario where we can predict the output verbal based on the predictor variables. In addition, many situations a set of inputs are available, but the output cannot be easily obtained. So to predict and an output variable we need first of all, to estimate dysfunction, which is an unknown function. We can predict the output variable using the following formula. Where y hat is the resulting prediction for and. For this output, y and F represents the estimate function or the estimate for f. Generally speaking, this formula is the prediction formula and this one is the actual relationship between the output variable and input variable. Y hat is the resulting prediction for the output variable y. And f hat represents the estimate for f or the actual function, which represents the systematic information that this predictor provide about the output variable y. The accuracy of y hat, which is the resulting prediction for y or the output variable. So the accuracies of y hat as a prediction for Y depends on two quantities which are reducible. Irreducible. Error. Reducer for error is an error that we can reduce so that we can get a good estimate for the unknown function by using the most appropriate statistical learning techniques to estimate a dnorm function. This function is the estimate for the real function, which represents the systematic information that x provide a worldwide. So even though if we estimate this function f, which is the estimate for the real function, this summation will never be perfect, which means that we never get a good estimate for f. For that reason, there will be some error in it. That error is called the reducible error, which is an error that we can reduce using an appropriate statistical tools. And the output variable, the resulting prediction for y or the y hat. And the accuracy of y hat actually depends on the reducible error. So reducer for error can be reduced using the most appropriate statistical learning techniques to estimate the real function. However, even if it were possible to form a perfect estimate for f or the real function, our prediction would still have some error in it. Because the output variable y is also a function of the error term. The error term, the error term is also called you reduce the error. Therefore, the error term also affects the course of our predictions. Remember that the error term in real life and independent variable are never perfect predictors of the dependent variable. So the error term tells how we can be certain hour to prediction. Therefore, the error term results are known and irreducible error because no matter how well we estimate the real function, we cannot reduce the error introduced by the error term. The second reason that we estimate F is for inference. In the inference case, we are interested to know how to predictor variable affects the output variable. In this situation, we only wish to estimate F, but our goal is not necessarily to make prediction for the output variable. We instead want to understand the relationship between the output and the input, verbal or in another word, the relationship between x and y. Or more specifically, to understand how the upper verbal change as a function of the input verbal. In this setting, there are few important points that we need to consider. We have to identify which predictors are associated with the response variable. Because it is often the case that only a small fraction of the available predictors are substantially associated with the output variable or Y. So we need to identify which offers are contributing the most to the response variable. The second thing that we need to consider is to understand what is the relationship between the response and each predictor. Whether some predictors may have some positive relationship with the output variable or a negative relationship. To understand the difference between inference and prediction. Let's say that a company is interested in knowing whether the email it received from the client, it is spam or not based on the types of the email. So this is an example of prediction. In this case, the type of email serve as the predictors. And the response variable is whether the email is spam or not, which serves as the outcome variable. In contrast, this is the inference problem. In this case, a pharmaceutical company wants to evaluate whether the patient is risky for a severe adverse reaction to a particular drug is due to higher level of cholesterol. So this situation falls into the inference bar, dying. Because in this case, the company is interested in obtaining a deep understanding of the relationship between the outcome of the drug and the higher level of Carl's Store. So generally speaking, inferential statistics is about understanding the relationship between the individual predictor and the outcome variable or the response variable. So it is a difference between prediction and inference. Well, I hope that you understood. And why do we need to estimate the function, the real function and which connects the input variable and the output variable. 23. Installing Packages for LinearRegression: Hi, Welcome to the linear regression model. You need to install first of all, these two packages, the first bucket is called deuterium. And this package is the data that we are going to use in this tutorial. So the data that we are going to use this guitar can be found in this package. You need to install this package. The second package that you need to install is called core plot. This package is used for visualization. We are going to make visualization the relationship between different variables. Once you install these two packages, you will require as well to load the packages. So let's run this backwards. 24. Simple LinearRegression: Hi, In the previous tutorial, we briefly discussed the statistical tools. In this tutorial, you will learn linear or guessing, which is one of the statistical tools to be used for estimating the unknown function that we talked about in the previous tutorial. Well, lunar regressing is a very simple approach for supervised learning. In particular, linear regression is useful tool for predicting a quantitative response. Quantitative variable is a variable which contains numeric values, such as age and salary. Well, there are many statistical learning approaches, but linear regression is a very important statistical tool. In this lesson, we reviewed some of the key ideas and the line, the linear regression model. Let's start first of all with simple linear or aggressive. Simple linear regressing is a very straightforward approach for predicting a quantitative response on the basis of a single predictor variable. For instance, this scatter plot to visualize the relationship between sales and U2. The linear regressing, our simple linear regressing, assumes that there is approximately a linear relationship between x, which is the predictor variable, and y, which is the target variable or the dependent variable. In this case, sales represent. Why. Because we want to predict sales based on this variable U2, which is x. And it is the predictor variable, also known as independent variable. So the linear relationship between X and Y, mathematically, we can write that relationship as y approximately equal to Beta 0 plus Beta 1 times x. In this equation, y represents the output variable. In this case, sales is y and beta 0 and beta one. These are two unknown tungsten that represents the intercept and slope terms in the linear model. So better 0 is the intercept and beta one is the slope terms in the linear model. And they are the model coefficient, also known as power matters. In this setting, x represents the YouTube ads, which is the money that's spent on YouTube advertisements in order to promote the sales. And y is the sales. Well, we can write the linear relationship between sales and YouTube using this formula. So it's going to be sales. Approximately equal, and the true parameters, which are beta 0 plus beta one times x. In this case, x is the YouTube ads. So we regress Sales onto YouTube ads using this formula. Well, the line that you see here is the regressing line and is the linear regression line. It is the actual model. And this model is built using this formula or this equation. Well, and the dots that you see here, such as the black dots and the red dots. These are the observed data. The distance between the observed data and the regressing line is called the residual. So these are the residual, the distance between the observed data. The regression line. It is also known as residuals. Residuals are the vertical deviation from the observed data to the regressing line. This blue line is the regression line. So mathematically we can write the residual equation, which is the difference between the observed data and the estimate prediction for y, which is y hat. The residual registry here is the difference between the observed data. Y hat, which is in this case the linear regressing and the blue line that you see here. Well, the most common methods for fitting a regressing line is the method of ordinary least square. Ordinary least square calculates the best fitting line for the observed data by minimizing the sum square of the vertical deviation from each data point to the line. What we use ordinary least square method to obtain the best fitting regressing line. And that line must minimize the sum square of all these residual. So the sum squares of the vertical deviation from the observed data point to the regressing line is also known as residual sum square. Rss. Rss is done for residual sum square. Mathematically, this can be written as the following equation are equal. The sum square of the difference between the observed data. The estimate prediction for the observed data. When any method which minimize this metric can, can be used to obtain the best regressing line. The best regressing line minimize this metric. So ordinary least square method minimizes this metric called residual sum square, the vertical deviation from the observed data to the regression line. Our goal is to obtain the best fitting regressing line. So we can use ordinary least square method to obtain the best fitting line for the observed data. And that method minimizes the sum square of the vertical deviation from the observed data to the regression line. As you see these lines, vertical lines are the residuals. Well, we don't have to calculate this by hand. But our program actually compute power matters such as beta 0 and beta one. And using this formula in our program, we can find out the pesto regressing line. In the next coming tutorials will perform some practices to understand exactly what I mean by the best method to obtain the best regressing line. 25. Hypothesis Testing: Hi, In the previous tutorial, we talked about simple linear regression. In this tutorial we're going to talk about hypothesis testing. Well, before we dive into some practical approach, let's evaluate the relationship between x and y. Well, hypothesis testing is a statistical method that is used in making statistical decision using the data. In another word is statistical testing is a process of drawing conclusion on the basis of statistical testing of the data. Well, the purpose of hypothesis testing is to determine whether there is enough statistical evidence in favor of a certain hypothesis. For instance, to evaluate the relationship between y and x, we use the known hypothesis. Well, let's say in this case that y represents sales, since it's the output verbal, or is the target variable that we want to predict on the basis of you to add you to us. In this case represents x is the predictor variable that is used to build the model. What if you want to evaluate the strength of the relationship between these two variables, sales and YouTube. And YouTube is x. Well, there are two hypothesis. The first hypothesis is the null hypothesis. The null hypothesis indicates that there is no relationship at all between these two variables between x and y. So the first hypothesis that we want to test is to make sure whether x and y, whether there's no relationship between x and y. The alternative hypothesis, it tells that indeed there is some relationship between x and y. What the linear relationship between X and Y, between the predictor variable on the target verbal regime and mathematically, can be written as the following formula. Which is why the target variable, the output variable approximate beta 0 plus beta one times x. What? We said that these parameters, pietas, your arm p21 are two unknown constant. Well, in case the hypothesis testing, the first hypothesis, which is the null hypothesis. The null hypothesis assumes that there is no relationship between X and Y. So in that case, it tells that the true parameters are obviously near 0 or 0. So in that case, the true parameters will be Peter, 0 would be equal to 0. And the second parameter is p21, will be equal to 0. Because when the two parameters. Are equal to 0. It means that there is no relationship at all between x and y. Well, but in the alternative hypothesis, if there is some relationship between x and y, the two parameters will be totally different than 0 because the two parameters we would not be equal to 0. So in that case, when the hypothesis, when the alternative hypothesis is true, then in that case, these two parameters will be different than 0. So we can write be different than 0. And as well, the p21, I mean peter different than 0 on P21 different than zeros. So this case is when the alternative hypothesis is true, which indicates that there's indeed some relationship between x and y. Well, now let's have a look at the scatter plot that you see here. There is some relationship between sales and YouTube as you see this regressing line. And you see that the trend, which is a linear trend, and the dots and the black dots on the red dots actually represent the observed data. So in this case we see a linear trend. There's obviously a relationship between sales and YouTube. We could also say that there's a positive correlation between these two variables. What not? The question is, how can we know that the null hypothesis is true? How can we know that these two parameters, beta 0, beta one, r equal to 0. Now how can we rejected the null hypothesis are obviously how can we accept that there is no relationship between these workers? And as well, how can we support the alternative hypothesis to make sure that there is indeed a relationship between these two burgers. Well, in that case, we use t-test. What? T-tests? Anytime we want to conduct hypothesis testing, we usually use the t-test. T-test is used to measure how many standard errors the core fusion is away from 0. So T-test is used to make sure how many standard errors that these two parameters are away from 0. Well, the greater the magnitude of T test, the greater the evidence against the null hypothesis. Which means that if we get a higher of t-test, we can't reject the null hypothesis and support the alternative hypothesis. And so in that case, we can make sure that there is indeed a relationship between sales and YouTube are, there is some kind of relationship between the two variables that we actually conducting for the hypothesis testing. Well, you have to know as well that every t-test has a p-value to go with it. P-value is the probability that the result from the assembled data occurred by chance. So a higher t-test value is associated with a lower p-value. Well, in order to evaluate the value of b value. There is a threshold and that threshold is called Alpha. So Alpha is used to determine whether we rejected the null hypothesis or accept the null hypothesis as well, whether we reject the alternative hypothesis or retain the alternative hypothesis. So the value of alpha is equal to 0.05 or 5%. We said that any t-test has a p-value to go with it. And then the p-value is the probability that the results from the sample data occurred by chance. So a higher t-test value is associated with a lower p-value. Well, if b value, if the value of this metric, if, if p-value is less than or equal to Alpha, we said that Alpha is equal to 0.05. If p-value is less than or equal to Alpha, then we reject the null hypothesis and go with the alternative hypothesis, which says that indeed there is a relationship between x and y. In contrast, if the value of b value is greater than Alpha, we, in that case, we reject the alternative hypothesis and go with the null hypothesis. So in that case, it said that there is not statistically significant, which means that we retain the null hypothesis and reject the alternative hypothesis. In case the p-value is less than Alpha, it said that it's statistically significant, which means that the fact that, that department as these farmers, as we said, that when the alternative hypothesis is true, these two parameters are different than 0. So when the p-value is less than Alpha, it means that the fact that, that these two parameters are different than 0 didn't happen by chance. But indeed, these two parameters are true to different than 0, which suggests that there's a relationship between these two variables, such as sales or YouTube in this case. Well, we can't assume that there's a linear relationship between sales and YouTube. So in that case, we can find a higher t-test value and a lower value. And in that case, we can reject the null hypothesis and go with the alternative hypothesis, which says that there is indeed a relationship between sales and you too. Well now let's have a look. This one, I have run the model, the linear model, into our program and we want to regress the sales on YouTube when I make summery mode and I've got this one. So this is the summary of the model. The p-value that we discussed can be found in here is the two values. The first value goes with the first power matter. Do you intercept? And we said that the intercept is Peter 0 and the value of p2 0, this one, as you see the estimate of the intercept. And this one. And the value that you see here is the second parameter, P21. Well, there's a relationship between sales and YouTube because we have here the b value. And these two values are less than the threshold that you set, Alpha or zero-point your five, you'd also say 5%. Well on here is the t-value. We said that every t value, There's a p-value to go with it. The higher that you value, the lower the p-value. Well, in this case, the p-value is less than the threshold 0.05 or 5%. In that case, we can reject the null hypothesis. Remember that null hypothesis assumes that there is no relationship at all between two variables, between x and y. Well, in this case the p-value is less than, the threshold, is that obvious is less than 5%. And that case, we can reject the null hypothesis. What in the next tutorial we're going to talk about some other metrics so that we can make sure that relationship. And we said that obviously there is a linear relationship between sales and YouTube. And after that, we dive into the practice and we can write some codes into our program. So we can find out the different values that we've talked about. 26. Evaluating Metrics for Linear Model: Hi. In this tutorial, you will learn how to evaluate the metrics for linear regression model. We're going to talk about the different metrics to evaluate the strength of the model. So evaluating metrics are a measure of how good a model performs and how well it approximates the relationship between the dependent variable and the independent variable. Well, one of the metrics to evaluate is the residual standard error. Residual standard error is the average variation of the observation points around the fitted regression line, also known as the modal Sigma. Well, the line that you see here, the blue line that you see here is the regression line. And the residual standard error is the average variation of the observed points. The observation points around the fitted regression line. Roughly speaking, the residual standard error is the average amount that the response will deviate from the true regression line. And it's computed using this formula. So it is a method to be used for assessing how much these points are away from the regression line. So the lower the value of residual standard error, the better the model fits the data very well. On the other hand, if the distance between the regressing line and the observed data is far enough. In that case, we can get a higher value of residual standard error. So in that case, the residuals on there may be quite large, indicating that the model doesn't fit the data well. So the point here is that the lower the value of residual standard error, the better the model fits the data very well. In contrast, if you get a higher value of this metric, it indicates that the model doesn't fit the data very well. Well, there is the ascender error is computed using this formula. You don't have to calculate this formula by hand. Obviously, FDR program will compute this for us easily and you can get the number of registers under error. Well, in this formula, the sum square that you see here is the difference between the output variable and, and the predictor variable, as you see here, y-hat. And here you can get the residual standard error as you see here in this case. When we regress sales on YouTube, this is the residual standard error that you have got and the number is 2.91. The residual standard error indicates how far do you observe the charge away from the regression line. And we said that the lower the value, the better the model fits the data very well. So rigid or some error in this case, we have 3.93.91 and it's not that bad. Assuming that there is a linear relationship between sales and you too. Well, YouTube is the amount of money that's spent on YouTube advertisements in order to promote the sales. Well, in this case, we assume that there is a linear relationship between sales and YouTube, which means that as we increase the amount of money that's spent on YouTube advertisements, we got more sales. Obviously we make more sales. So this is how we can interpret the linear relationship between sales and YouTube. As the advertisement on YouTube platform increases, we increase as well. And we got, we make more sales. So this is the linearity and the mineral relationship between these two variables. Well, the second metric to evaluate is called R-squared. R-squared provides an alternative measure of fit. It indicates how close the observed data are to the fitted regression line. So the value of this metric is between 0 to a 100 percent or higher. R-squared indicates a relationship between y and x. And a value closer to 0 suggests a weak relationship between the variables. Well, the R-square for this model can be found in here, as you see here, multiple R-squared. So this is the value of R-squared. We said that R-squared is between, the value of R-squared is between 0 to a 100 percent. In this case we have 61%. So the most common interpretation, this metric, r-squared, is how well the regression model fits the observed data. For instance, in this case we have 61% and this value reveals that roughly 61% of the data fit regressing model. Generally speaking, a higher R-squared value indicates a better fit for the model. In contrast, if the value of this metric is closer to 0, then we can say that model doesn't fit the data very well. Well. The problem with this metric and the problem with R-square value is that its value increases with the addition of another independent variable and can mislead the result. Well, in this case we have only two variables. We have the output variable, which is the sales and YouTube ads, which is the predictor variable. If we add another variable, such as, let's say Facebook ads. If you want to work, guess the Sales onto these two variables, YouTube and Facebook ads. In that case, we will have a higher, a higher value of this metric, r-squared. So anytime we add another new variable in the model, this can mislead the result because obviously the R-square increases, even though that variable is not good for our model. In order to overcome this issue, the adjusted R-square adjusts for the number of parameters in the model and its value increases when the addition of that new variable shows some improvement in the model fit, or else it remains the same. So the adjusted R-squared statistic is more preferred than the R-squares density. What the adjusted R-squared value, in this case, it's 6.6%. As you see here. The greater the value, the better the model. We have said that anytime we add another new variable in the model, the adjusted R-square value increases if that verb improves a model. In contrast, the R-squared always increases. If you add a new variable, regardless if that variable improves the model or not. That's the only difference between these two metric. Last but not least, mean square error is very useful metric to be evaluated. Mean square error is the average of the squared error, and it's the sum of all data points of the square of the difference between the predicted and the dependent variable. The error that we're talking about in this case is the difference between y hat and the dependent variable is y. We could also say that in this case, say this represents y. And the predictor variable. We said that YouTube ads is the predictor variable that we use to build a model. The goal is to get a very low number of this metric. Well, I hope that you understood the different metrics to evaluate in order to select a good model. The good model is the one which fits the data very well. And we talked about R-squared, the adjusted R-squared mean square error or residual standard error. These are the different metric to be a valid. And it's very, very important metrics. You need to evaluate these metrics before selecting the model. Well, the point here is to make sure with the model is predicting the data very well. 27. Correlation Plots: Well, the data that we want to use this tutorial is called marketing. And this data contains the impact of three advertising media such as YouTube, Facebook, and newspaper on sales. So sales is the target variable. So this data is Java timing budget in 1000 of dollars along with the sales. Let's see how many variables in this data using names. The name of the data is marketing. As you see, these are the variables in the data, and sales is the target variable. Well now let's see the relationship between the target variable and the predictor. Remember these are the predictor variable you to Facebook and newspapers. Sales is the target variable that we want to predict using the model. So the first thing to evaluate, which is very important thing, is to understand the kind of relationship that existed between the target variable and the predictor variables. Well, let's find out the correlation coefficient. So the correlation coefficient measures the level of association between two variables, x and y. Well, the correlation between sales on YouTube is 78. Remember that the value of the correlation ranges between minus one and plus one. In case it is minus one, it indicates a perfect negative correlation, which is the case when x increases and target variable, the output variable also decreases. So in case it's a negative correlation is when x increases, y decreases. In contrast, if the correlation value is plus one, and in that case it indicates a perfect positive correlation. We just when x increases and the other variable and y as well increases. Well, in this case we have a positive number. This is not a negative number, so it indicates a positive correlation that exists between these two variables. 78 is a good prediction, is closer to one. When the values closer to one it didn't gets a positive correlation, which can be nearly perfect. But if the value is closer to 0, it suggests a weak relationship between the variables. Well, now let's see the correlation between sales on Facebook. Well, we have 57, we're just not that good. If we compare to the first correlation, the first correlation between sales on YouTube, it's 78, which indicates a very good, perfect, nearly perfect relationship between sales and YouTube. The strength of relationship between sales and YouTube is much stronger than the relationship between sales and Facebook. Well, there is another way to see actually the different correlations between variables. We use car plot package to visualize all the correlations and all the correlations between sales and the predictor variables. What we use core plot function, which is the box that you installed, cord, It's called carpet. The first argument that this function takes is that the correlation matrix of the data. So we write core function. Within the core function, we write the name of the data. In this case, it's a marketing coefficient color. What is the correlation plot? Let's make a little bit bigger. You can get here all the correlations between the variables. For instance, the correlation between sales and YouTube is 78. The correlation between newspaper on YouTube is six. The correlation between sales and Facebook is 58. The correlation between sales and newspaper is 23. One of the reason that we use the correlation plot is to evaluate the kind of relationship that exists between variables, whether that relationship is negative or positive. In this case, there's no negative value in the correlation plot. All of them are positive and indicating a positive relationship, which means when x increases, y increases. So this is a positive relationship. 28. Multicollinearity: When we want to build a linear model, you need to check for multicollinearity. Multicollinearity is when the predictor variables are highly correlated with each other. So what that means multicollinearity is when these VRS and which are the newspaper, Facebook, if this verbals, if there is a relationship amongst these were, then there is a multicollinearity in the data, which is not good for the model. So if there is multicollinearity present between the predictor variables, such as newspaper, Facebook, YouTube. In this case. If there's a kind of relationship between these verbals, then if we lead to misleading result. So we have to make sure that the predictor variables are not correlated with each other. So this has assumption can be easily tested by using the correlation plot that you see here. So correlation plots visualize the kind of relationship that exists between the variables. So for all the independent verbals, if the correlations are opposed AT, then there may be the presence of multicollinearity. Well, in this case there is no multicollinearity in the data. As you see, there's no value which is nearly equal to 80 or something like that as a street, that relationship between newspaper on YouTube is six. And as well, the correlation between each paper on Facebook is 35. Facebook between you two is five. And Facebook, though the correlation between vice versa, a newspaper is 35. What this indicates that there is no multicollinearity in the data, which means that obviously there is no relationship between these variables, which are the predictor variables. Suppose, if the correlation between newspaper, fun, YouTube was eight, then in that case, there is a multicollinearity present in the data. And the effective solution to tackle this issue is to drop the colinear variables. 29. Train&Test split: A model is a machine learning algorithm that has been trained to recognize certain types of patterns. In this case, we train a linear regression model over a set of data and data we have, this case is marketing. Well, in order to build a model, this data that you see here, marketing must be divided into two sets. The first set is called training set. The training set is a dataset of examples and is used to train the model. So simply the model is an algorithm. Well, our lens patterns from the data. Well, 80% of our data is used to for the training data. Once the model is built, we evaluate the performance of the model using the test set. So this data then here marketing, we divide this data into two sets, train set and test set. We're now let's build the linear regression model. Well, we set the seed for reproducibility. So let's see this used for reproducibility because we are making symbols. So set C, this for reproducibility, that you get exactly the same sequence of numbers. Then we create sample. So somebody takes assembled of a specified size from the data that our data is marketing. And we sample, we take sample from, from the data marketing. And the size of the symbols. You're going to be 80 for the training datasets. Great symbol size. What this dissemble size. The sample size is 80 percent for the terrain data. On here it is assemble data. We use some function symbol. Point is this is a function and this function is used to take. A symbol and a random sample of a specified size from the data. So this is the data marketing and the size down the sample size is 80. So this, this argument size takes sample size. We already declared the sample size before. The, this one. As you see here, is the symbol size. So the sample size is 80, 80% for for the train data. Well, now we write the training data. We've got here a symbol size. What does the training data? The training data is about 80 percent of the data marketing. Now we write the test set. The test set, and this is the training data. Highlight. All of them. Run this code. When we created the train data and the test set data. As you said, the train data is used to build the model. And once the model is built, we test the performance of the model using the test set data. Now let's see the train data. I just see the trend data is assemble of a specified size, which in this case the size is 80 percent. And so these are random samples from marketing dataset. And the size of this data is 80, 80%. Let's see. And the number of rows in this dataset train data. I just see a 160 before it was 200 and then the size of the data was 200 and number of rows. Marketing data? It was 200. And she now is the train data is a 160, which is roughly 80 percent of marketing data. And the test set 40 rows. It's roughly equal 20% of the marketing data. Well, we created the train data and test set data. We divided marketing data into two sets, train and test. Now we are ready to build our first linear regression model. 30. Simple Linear Model: When we start, first of all, a symbol, linear or guessing more than, we add only one variable, one predictor variable in the model. What is a simple linear regression model? The model one is the name of the model. This function, lm is the linear model function. The first argument is the data. In this case, the data is the marketing. Well, the variable that we want to predict, and we'll just say, in this case, sales serves as the output variable, the target variable that we want to predict. So we want to predict sales on the basis of u to u. Two is the predictor variable. This is simple linear regression. Simple linear regression is a very straightforward approach for predicting a quantitative response. So say this is a quantitative response variable. We predict sales on the basis of u two. So in this case we have only one single predictor. Run this code. Now to see, to see the model, we write summary function. And in that way you can actually understand the different parameters and the different evaluation metrics. Well, this is the summary of the model. Call this the linear, the linear model function formula. The formula that we use these sales are regressing sales and YouTube do not forget this simple. And this is the data. The data name is marketing. Well, this is the residuals. Basically, residual is the difference between y and y hat. Well, and these are the coefficient. We have different coefficients such as the intercept, the slope. And the intercept and the slope are the model parameters and their values are estimated using the best fitting line. The value of intercept is this 1.48 and the value of slop is 0.04. Well, these two values, our estimated using the least square arginine. So this function linear model, computes the value of these two parameters. Well, how can we interpret, how can we interpret this value, 8.4? So remember in this case, we are operating in units of $1000. So this means that for you to advertising budget equal to 0. And let's say in the absence of YouTube advertising, we can expect is sales of 8,440, which is 8.44 times 1000. Remember we are operating in the units of thousand dollars in this data and marketing. Well, so this is how we can interpret the intercept. And this is the slope. The value of the slope is 0 for this, the second parameters of the model. So this means is that for, for a YouTube advertising budget equal to 1000 dollars, we can expect an increase of 48 units in sales. This value we multiply by 1000 because we are operating in units of thousand dollars. So we multiply it by 1000, this value, we get 48. And this is a t value. The value of t-test. Remember that t-test is used to measure how many standard errors that these coefficients are away from 0. The other thing that you need to remember is that every t-test, every t-test has a p-value to go with it. So a higher t-test, a t-test value is associated with a lower p-value. Well, in this case the p-value is less than the threshold. I think you remember the threshold with which was five or 0.05. What in this case the p-value is less than the threshold is less than 0.05, which tells that we reject the null hypothesis and we accept the alternative hypothesis. Well, on here is the residual standard error. Residual standard error is the average amount that the response will deviate from the true regression line. Well, we said that the lower the value of this metric, the better the model fits the data very well. In this case, we have 3.91, which is great. It's not that bad, indicates that the model fits the data well. And here is the R-square. R-squared provides an alternative measure or fit. It takes the form of a proportion. The proportion of variance explained. And we could also say that r-squared is a measure of the linear relationship between x and y, in this case between sales and YouTube. The value of R-squared, in this case it's 61%. I think you remember we said that the value of R-square range from 0 to one. If the value is closer to 0, it indicates a weak relationship between x and y. And if the value is closer to one, it indicates a good relationship between x and y. In this case, it's 61. The problem with R-square value is that its value increases with the addition of independent variable and can mislead the result. Well, suppose now if we add another variable in the model, and obviously the r-squared will increase regardless if that variable will contribute to the model or not. In order to overcome this problem, adjusted R-square is introduced to the adjusted R-square, adjusts for the number of parameters in the model. And its value only increases when the addition of the new parameters show some improvement in the model fit well. And this is the F statistic. So F statistic is used to evaluate the null hypothesis. So remember the value of S, that's the value of F statistic range from 0 to a larger number. Now in this case we have 300 and tools which is far, far away from 0, indicating we can't reject the null hypothesis. So the point here is, if we get a higher number of this metric, we cannot reject the null hypothesis. Well, now let's see how the model performs on the test data. What now we want to evaluate how the model performs and we, we evaluate the performance of the model using the predict function. Well now let's find out the value of mean square error. Mean square error, which is a very important metrics to evaluate so that we can make sure whether the model fits the data very well. Does the predict function. We predict the model using on the test set. So the mean square error, or MSE. What is the mean square error? And remember the mean squared error is the average of the square of the difference between the observed data, which is in this case test the sales. Sales, remember is the target variable. And this is the prediction. And the prediction is this one that we already declared, the prediction function. And this is the square or on this road. And the mini square of this model is 15. 31. Multiple Linear Regression: When we add now some more variables in the model, in this case, we are no longer talking about a simple linear regression. It is multiple linear regression because the model contains more than one variable. It's always going to be a multiple linear regression model to want to regress sales until all of the variables. So the point indicates that we want to regress Sales onto all of the variables in the data. So all of them newspaper, Facebook, YouTube, all these three variables we add in the models. We see how, how the model performs. If the model is better than the previous one. So let's run this code. Now we see the summary of the model. Model two. Well, the R-square now is 89, is higher than the previous one. The previous one was 61. Now we have 89 and the adjusted R-squared is 89, much better than the previous one. It was 60 percent. So this indicates that this model, the model 2, is much better than the previous one. The residual standard error now is two, is much better than previous one. It was three before it was 3.91 and now is 2, 0, 23. Well, if you look at the model, newspaper doesn't have any relationship with the target variable. So zero-point 86 is greater than the Alpha is greater than the threshold, meaning that we accept the null hypothesis, which indicates that there is no relationship between newspaper and the target variable sites. If you see the correlation plot, this is true because the newspaper and sales, the correlation between newspaper and sales is just one tree, which is very, very low. Well, in this case, we need to drop this variable because this doesn't contribute to the model. And it's not essential to keep this variable in the model because the p-value is greater than 0.05. Now let's delete this variable. So you're going to be monitoring the third model as adverb or YouTube. Facebook. Summary of the model. Well, as you see now, with the, with only two variables, the model indeed improves. If you look at the adjusted r-squared, those before it was 8956 and now it's 89, 62. And another word, the residuals on the arrow now is 2018. And before it was 2.023. Of course, the model is much better than the previous one. So the model improves when we delete the non-essential variable, which was newspaper, doesn't have any relationship with the target variable. If you look at the F statistic now is 859, which is very large number convert to this one. Those 5570, the greater the value of this metric, the better the model fits the data. Which also indicates that indeed there is a relationship between the target variable and the predictor variables. 32. Diagnostic Plots: Well now let's make a few diagnostic plots in R to check the linear relationship between sales and these two variables. Say diagnostic plots. What now we are going to make some diagnostic plots in order to check the linearity and the linear relationship between sales on the predictor variables. So we plot, we visualize the model and the model. The last model, which is this one, model through. Well, let's make it a little bit bigger. Okay, here we have four plots. This plot shows the residuals versus fitted values and it's used for checking the linearity. The linear relationship between x and y, in this case is between sales and the predictor variables, YouTube and Facebook. Well, if the linear model assumption is not satisfied, then the residual displays some shape or button. In contrast, if there's a relationship between the independent variable and the dependent variable, the scatter plot of residuals should not display any pattern and the residual must be equally spread. And the residual in this case, is equal to spread, our own y equals 0. There is no specific shape that you can see in this plot, meaning that indeed there is a relationship between sales and the predictor variables. Well, the second plot is called quantiles, quantiles or QQ plot. This plot shows the theoretical quantiles and it's used for checking as well the normality assumptions that the residuals are normally distributed. Well, if, if the residual deviates from this dashed line is should the dashed line here, if the residual is deviated from this line, the dashed line, and then there is no relationship between x and y, or in this case between sales and the predictor variables. In contrast, if the residuals appear to be lying on the straight dashed line, then it indicates a good relationship, a good linear relationship. Well, in this case, the residual appears to be well aligned on the street dashed line, which means that indeed there's a relationship between sales and the predictor variables. The third plot is a scale location plot. This plot shows the square rooted standardised residuals versus fitted values and is used for checking the homoscedasticity assumptions. Well, homoscedasticity assumption is when the variance of the error term remains the same across all values of the independent variables. Well, if the residuals are not randomly spread and the red line is not horizontal, then it's not good. In contrast, if the red line is horizontal with randomly spread data points, then it's good. In this case, we see a random spread data points and indicates a good relationship between sales and the predictor variables. Well, the fourth plot shows the influential cases. Influential observation is an observation whose deletion from the dataset would not really change the shape of the regression line. Meaning that if we remove these values from the data, then the line would look different. I mean, the regression line. If any cases are present in the data, they must be excluded from the data. What in this case, there's only one observation behind the course distance, you see is a 131. So in this case, there's only one observation. I don't think so. This was the version will change door aggressing line. Well, now let's find out the value of the mean square error of this model. Predicts three. For the third model. Use the predict function, right? The Model 3. And test set. For the third one. The third model. Well, the mean squared error for this model is 2.80, which is much better than the previous one. The first mean square error we calculated was, it was a 15. Minute square root was 15. And this one is two. And this indicates that the third model is the one that you can select based on this metric. This metric mean square error is very important metric to evaluate as it indicates the performance of the model. Well, we can say that this model performs better than the previous one. As the model improved issue. There's a great difference between this mean squared error and this one, 15. And now it's 2.8. Well now if you want to compare the predicted values on the target values and the observed values. Let's combine this as C B9. We're going to combine together the test set and the predictor. So the predictor through. So the first column is the ID column. This one is prediction and this one is the actual sales, is the sales is the observed data and this is the predicted data. When you can compare, the actual data was trove and the predicted value is 15, you see is little bit different. So the mean square error is the average square of the difference between these two values. So the gap between the predictor values on the target values is very, very small. And you can't see that the mean-square error in this case, to a very good model, performs better than the other one. Well, this is how we can evaluate the mean square error and see exactly how the model performs on an unseen data, which is in this case the test set data. Well, now let's see how we can plot a very important diagnostic plot. What this plot is about comparing the predicted values versus the actual output. We convert predicted values versus the actual output. The first plot is for the sales and the target variable, the observed data. And this one, the lines, this code you see here is for the prediction. So we compare the predicted values versus the actual output. Let's see in a second how the plot would look like. So highlighter around, make it bigger. Well, this blood shows the comparison between the predicted values and the actual output. So this bulb shows the target verbal versus the predicted outcome. It seemed that the most of the cases, these two are overlapping with each other, except in a few cases. The blue line is the predicted values and the red line is for sale is the actual output variable, the target variable, I mean. So these two lines are overlapping with each other, except in a few cases. In this case, it's seen that the most often predicted values are closely following the actual values, indicating a very good model. 33. Scatter Plot: What, for instance, let's say we want to plot the relationship between sales and you too. So we're going to scatter plot your visualization, which shows the strength of the relationship between two variables. What this plot shows, the relationship between sales and U2, dysfunction attached, attached to the data so that we can access the variables in the data. Let's change. This, make sales Y and X is u2. So the first variable is always the predictor variable, and the second variable is the target variable. So YouTube represent x, the predictor variable, and sales is the target variable, which represents y. Let's run this code. I'm going to make it a little bit bigger. And this is the relationship between sales and YouTube. And you see the linear trend that existed between these two variables. As you to add increases, the sales, increases. The more money we spend on YouTube advertising. We also make more sales. So this is how we can interpret the relationship between sales and YouTube. Right now I'm going to add the linear or guessing line inside this plot. It's going to be lines. Let's say the first model that we created, the color of the regression line and say blue and the thickness, let's say two. This one is for the thickness of the line. Well, we got an error, okay? And it's not lines. The function that we use and outlines, we use up, line, up, line, up, line. Now let's run this code. Well, you see that the blue line, which is store dressing line, the linear regressing line. I think now you understand how you plot, how to make visualization of the model itself using, applying, and as well to plot the relationship between two variables, dependent variable and independent variables. Well, you also understand the different metrics to evaluate for so that you can select the PESTEL model. And this was all about linear model practice. And now you can make your own linear regression model. You can use different datasets for predicting different variables.