Statistics for Data Analysis using R Programming | Venkat Murugan | Skillshare

Playback Speed

  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Statistics for Data Analysis using R Programming

teacher avatar Venkat Murugan, Data Scientist

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

12 Lessons (1h 4m)
    • 1. Introduction

    • 2. Mean and Median

    • 3. Standard Deviation and Variance

    • 4. Range

    • 5. Skewness and Kurtosis

    • 6. Normality test

    • 7. Sorting of Variables

    • 8. One sample t test

    • 9. Correlation

    • 10. Detecting Outliers

    • 11. Frequency Table

    • 12. Quartile

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

This class will take you from a complete beginner in programming with R to a professional who can complete data manipulation on demand. It gives you the complete skill set to tackle a new data science project with confidence and be able to critically assess your work and others.R is one of the top languages to get you where you want to be. Combine that with statistical know-how, and you will be well on your way to your dream title.  

I will take you through descriptive statistics and the fundamentals of inferential statistics.  

We will do it in a step-by-step manner, incrementally building up your theoretical knowledge and practical skills. We will also look into finding correlation,skewness,kurtosis,outliers,Range and performing normality test,t-test and sorting of variables etc.

Meet Your Teacher

Teacher Profile Image

Venkat Murugan

Data Scientist


Hello, I'm Venkat.

See full profile

Class Ratings

Expectations Met?
  • 0%
  • Yes
  • 0%
  • Somewhat
  • 0%
  • Not really
  • 0%
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Introduction: Hello and welcome to this class off introduction toe statistics with our program. So in this last, I'm going to cover all the topics theoretically as well as I'm going to give a clear cut demonstration in us to do about all the functions that I'm going to use, as well as all the packages that needs to be in start. So in this last, I'm going to cover the topics like the central tendency measures which are mean and median how you can capture them. Then you'll see how you can calculate the measures off dispersion which are nothing but range, standard division and variants. Then you will see how you can calculate the normal distribution parameters. Hogan calculates que nous and keratosis. Then you're going to see how you can check for the normality and out our assumption. So this assumption are very important. Before performing any any I noticed on hypotheses like one sample T test two sample T tests or appear to test. Then you will see how you can perform sorting off data based on one or more than one variable. Then you're going to see how you can perform the one sample detest and before performing. The one sample detest how you can check for the assumptions like normality and outlet. Then you will see how can cover the correlation coefficient between two numeric variables. So correlation on there is nothing but the Bible relationship between two numeric variables . So you're going to see the new medical method in our studio as villas. You're going to see the visual representation off the strength off association between two numeric variables using this scatter plot, then you're going to see how you can build ah, frequency table for a categorical variable using the bill via package. And finally, you're going to see Huggel Catherine de Quartile values. So you will see a new medical method in us to do as well as a visual method off calculating the quartile values using the box broad. So let's get started. And I was suing the class 2. Mean and Median: Hello and welcome back. No, let's see how we can calculate the mean and median using our programming. So demonstrate this. I'm going to use our d four day does it which is named as ice and let's see what these data contains. So we have to use the view function followed by the name of the retested, and we have to press control enter. So as you can see their data in front of us, which is which contains five variables on DA, mostly, the variable has numeric values, but this variable species, which has some categorical values so no orders get back to the court and looks see what is a mean off the column or the variable simple dot land. So again, we just have to use the main function pretty straightforward, followed by the name of the data set dollar sign, toe access or which helps us to enable ah, the particular very well or the column name. So this is the argument and mean is the function name. So let's run it and we have the mean value over here. OK? Similarly, we can use the median function again, the arguments that the same does it name, followed by a dollar sign which will help us toe access the particular column or D variable . So again, press control enter. So we have 5.8 as the 1,000,000. Similarly, we can also look at the maximum and the minimum value off a particular column or variable. So the function name is pretty straightforward again, which is nothing but max or maximum value. Okay, if I run this so this is 7.9 is the maximum value for the variable simple dot land. Similarly, if I want to see the minimum value so it's again the function name. As men, arguments are the same. Control enter and we can see 4.3 as the minimum value. 3. Standard Deviation and Variance: No, let's see the other measures of dispersion which are standard division and various. So the standard division is a statistic that measured the dispersion off a data set related words mean and it's calculated as the square root of the villians. So if the data points are further from the mean, there is a higher standard deviation. Reading the day does it does the more spread out the data, the higher is the standard deviation. So how we can implement or how we can calculate the variance and standard deviation using our programming so is very simple. So let's ah, use the same data sent, which is eyeless. So there's going to do with that and go find the variants. So the function name is var and the perimeters are the same. So like, I have to specify the NATO said and ah, suppose I want to do with simple length, so control enter. So this is the variance okay. Similarly, for standard division, it is nothing. But as deep is the function name and the parameters are the same, so can quickly copy and okay, so parameter is the same. So if I press control, enter upstairs some energy in that. Okay? So as D has to win the small letters not in capital, so you can see the standard division as zero point it to it. Zero number 61 So eso the various is nothing but this quadrant off are the standard division something We're discredit off various. So if, ah if I have toe calculated so we just you can go You can use the square root function, sq rt and you can just specify the variants control Enter. So this is the standard division. Okay, so this is another way you can find the standard deviation using the value of the various so standard division the formless. Pretty straightforward. It is nothing, but it shows how much is the observation is different from the mean or how it is deviated from the mean So the formula is the observation supposed x. I is the particular observation minus the mean and the entire square root divided by the total number off observation. If you're talking about population, if we're talking about the sample, it is n minus one. So that is a formula for standard division 4. Range: So in the previous video, you have seen how you can calculate the mean and median off are variable. So in statistics, the measure of central indigency gives a single value that represents the old value like mean gives the every value off the entire variable. Similarly, the median is nothing but the middle most value for that particular valuable. So now in this video, we're going to see some measure of dispassion. So the measures of dispersion helps us to study the variability off the items or the observation. So in a statistical sense, this person has two meanings. First, it measures the variation around the average and also the variation off the observations among themselves. So there a number off a measure of dispersions that we're going to see. So the 1st 1 is the range. So range is the simplest measure of dispersion, which is defined as the difference between the largest and the smallest value. So they must read this. I'm going to take the same data set that you have seen in the previous video. So if I can quickly assure you again So this is the same data set. Okay, five variables. Andi, get the range, which is nothing but the difference between the maximum and minimum value off a variable. So we have to use the range function quite straightforward on the perimeters is the same that we have already seen. So let's quickly. Okay, the eyeless, the name of the readers of these ideas. And we use dollar sign toe access, the particular variable. So in this case, you're going to see for simple dot length. Okay, So if I press control, enter So as you can see, there are two values 4.3 and 7.9. So this to values you have already seen in the previous video, which is nothing but the minimum and the maximum value. Okay, so the range is the difference. So this range function will give only the values which are maximum and minimum for that particular very big. So to get the range, we just have to subtracted. So let's see, 7.9 minus is for 0.3. So 3.6 is the the Grinch 5. Skewness and Kurtosis: Now let's see how we can find these que nous and cart ASUs in your data so skew nous is a measure off symmetry. Or more precisely, you can say it is a lack of symmetry. So whenever you talk about the normal distribution So the skill nous in the normal distribution is always zero. That means the data is divided symmetrically or it considers a bill ship. When we draw the distribution off, the day does it. But whenever there is a skill nous so schools can be either negative, positive or zero. So, like I said, zero schooners, that means the data is normally distributed. Similarly, the date up can be positively skewed or it can be negatively skewed. So whenever you see se that details positive rescued, that means there dude up is right, Scoob, that was a tale is on the right side. That move has more number off positive extreme values. We deviate from the mean so similarly could assist keratosis is a measure off. Weather data are heavy tailed or light deal related to the normal distribution. So skill nous as we skill nous is the measure off symmetry, whereas the keratosis is a measure off the big topic. The distribution off the data is so if the distribution of data follows a normal distribution. So in that case, it would be called us Miss occurred. So many with distribution of the date up, has a large positive number. So it will be called us left to cut it. So a positive exists. Cardoza's distribution is mentioned as left occurred. Similarly, uh, negative card loses value will recall this Gardez qwerty. So to demonstrate how to find out this que nous and the car doses in our data on would use the same reader said, which is the ideas data set. But before finding these que nous and courtesies, we need to install a package. So we've told the package we have to use the function installed or packages on the name of the packages moments. Okay, so I'm going to press control, enter, since this package is already installed in my ah, our studio. But if it is installed in your studio in your system than this find, then you don't have run this particular function to install the package. You can straighten, ever use it. But if it is not installed, new package on in your our studio. Then you need to install it using this particular function. Okay, So since it is already installed in my ass to do so, I'm not going toe. Um actually ah, use dysfunction. So I street of it. Use another function toe, make it active it. Okay, So before using this moments, I have to activate it. So how you can accurate it, We can just use the library function and the name of the package. Okay, so if I press control ender So since I know it is running in the background says there is no message off any error. So it is running in the background. So now we can use the this particular package. Okay, so but make sure that it is stalled in your system because otherwise, uh, the the following concepts are based on this particular packets. So there was if you were difficult for you to follow. Okay. So to find skill nous So we just have to mention the functions que nous on the name off the d does it, which is nothing, but it is that we have already seen simple lord length country enter. So this is the value off school nous. So point 311 753 was So this positive value indicates that your date up for that particular available simple dot land is positively skewed because it is more than zero. His notice It is more than zero, so it is positively skewed. That means it is right, Scoob. No, never see the carcasses, so the argument are going to be the same. I This is a data set and control enter. So this is the value of fuel produces, So this value indicates it is it is. It is a positive number. So this value indicates that your data follows our leapt According distribution. It means ah, like a sand earlier. Miss a critic is for your normal distribution when whenever the distribution is follows a bell curve. But when it is more than zero, I mean, it is ah, large positive access number. So then it discovers left Okkert. So since it is 2.4 crew, 60 deserve a positive excesses exist Cardoza's distribution. So in this case, it will record less left according distribution. So this is how you can find out, are your students and produces, which is very, very important. Um, aspect off your distribution off your data 6. Normality test: Now, let's see African do the no Manitou test. So? So it is very essential. Ah, whenever you're doing any hypothesis, testing like one sample T test or two sample T tests are prepared to test. So there are busy assumptions that needs to be followed for performing that detest, or whether that the results of the test or any high point to his dissed is very so. One of them is the normal presumption. So whether the variable is following with the variable is normally distributed or not, or whether it is called a normal or belcove distribution on Soto perform the auto check with it is normally the studio, not we will perform the capital will test separate very test, no matter to test that we do in our programming just to show you how we can go ahead and check the assumptions also before actually performing the the T test. Okay, So, uh, so d days, as you know, that is the test around the mean off the sample. Okay. Ah, So for the most of that, we will check whether the variable simple, dark length is normally distributed or not. So it is again the same data set Or did it come from which is itis So OK, let's jump right in. And, uh, just another piece of information is that whenever you start the our studio by de four d starts package gets loaded. So just check again with there it is in your our studio or not, and s just if you want to check, you just have to go to the package tab right over here. Just follow the cursor, okay? And here you just type. So whenever you pick up the package, you will find a lot off packages Enbrel packages for you to use. So just type stacks on you will see the starts is already present in the system. So just ah, before concluding the test, just check. Okay, so since the start back, it is already installed, so let's go right ahead. So the function name is Shapiro. Dark test. Okay. And the parameter name is Ah, I Reese. On since you have you going to use the CEPAL length, So Okay, so now if I just run it So these are the values. Okay, So the separated, normally testing that the values as you can see, the P value is zero points 01 So So the P value, actually, why? It is important Because so the we compared the people live with the I'll follow. But at first nothing but your confidence level. Okay, So whether it is by default zero points, you know, find for sometimes it can be 0.1 also. So since ah, since it is 0.1 So it is clearly Ah, So the P value is, as you can see, the 0.1 s. So it is clearly less than Alfa value. So if we said we should support if we take 0.5% for value, so it is less than the alpha value. So in that case, since bees listen, l far so by by the rule off fi partisans testing regional hepatitis is on with his original hypothesis. To tell that or to conclude that the variable that we're tasting for No, my deception is not normally distributed. So it is so if you reject the null hypothesis, it is clear that the variable is not normally distributed. So it is going straight forward. So I but it is very important toe pace the assumptions before going for any hypothesis. Testing. So mostly there are. You have to check for the out flyers and you have to check for the normality. So they had to important assumptions. So this is the way how you can check for humanity using the shop. Your work? No married guest. 7. Sorting of Variables: Now let's see how we can do the sorting off valuables in our programming. So So let's quickly have a look on the data frame again. We're using the same Brita frame called ideas. So if I just showing it off for him So these seem to different that we're working again. And so suppose I want toe use thesis, simple length and supple wit, Thes two variables a za parameter to sort the values. Okay, so let's see one by one. Suppose I want to sort the variable Simple lord length in ascending. So how we can do that? So for that I have to create one more. A new are variable called I this one. Or you can name anything. So just for continued the purpose I'm naming it as I this one. Then the name off your original data from and what did the arguments that argument has to be in square brackets and for sorting? We're using a function court order. Look it. And in order again, you have to specify the name of the originator frame and the variable on which you're sorting the data or the the indicted difference. Okay, so the starting variable here they're sees simple Rockland, and then you have to put a comma. So this comma before this closing off the square bracket is very, very important because this fate this is the way to tell the program that we want to retain all the variables off the data frame ID's. So if I run this so here I this one is created. So as you can see, this variable, simple lot land has been sorted in ascending order. Okay, so if I can show you simultaneously the data frame I lease, so that would be okay. So this is the main data from simple Nordland. And this is the sort of look at so as you can see, it has been sorted in ascending order. So now let's see how you can sort the same variable, or you can take any other really well, let's do with the same variable, simple lot land and how we can start getting descending order. So let's create a new variable called Iris Go and ah, things are the same again. You have to inscribe bracket again. The order function no so sorted in descending, they have to use just a small minor difference that is, we have to use a minus sign. OK, that is the only difference from the previous core. So again, the argument is the same. You have to use ID's. You got it? Suppose separate. Lent. Okay. And again, a comma. So no, If I run this, you will have. Ah, our data frame called ideas too. And head, you can see So all the three data frames are opened when my one so you can see no discipline, Lord length, It does miss ordinate descending. That was the maximum number at the top and it decreases as we go down. Okay. No, let's see. Ah, there's close this three. Okay, now let's sort or data onto parameters. Suppose I want to sort our data in ascending simple length and in ascending simple wit. Okay, so quickly. Just have a look at this. Three of a new name off the data frame again. Ice. So this is the same again? We have to use the function cord order. Okay. No, you have to specify. Here do parameters. So first parameter stay. Let's say it will be simple. Nordland. Okay. And the second perimeter, Let's say it will be secret lot with. Okay, finally answer the comma like we have seen it previously controlled. Entered. So here we have. Okay, so it has been sorted in ascending siple it as well as in ascending Simple it because you can see what it No. Let's Jeez, the perimeter. So let's sort it by ascending simple lot land and descending Simple Lord with Okay, so let's quickly changed the cold over here only. So I'm creating a new ah different called ideas for and ah, let's say I want to create Are descending super lot land and ascending simple Lord route. Ok, so that is only change. You have to perform control. Enter. So here we have it. So no, you will get, uh, are descending simple Nordland. Okay, the maximum value of the top and ascending civil lots of it. So this is how you can perform sorting for your numeric variable using our program 8. One sample t test: I don't really see the concept off one semper t dest. So before conducting a one sample do taste Ah, it is very necessary to know the assumptions. Ah, to conduct the one sample t test. So there are two basic assumptions which are the variable of study is normally stable. So the label which we're testing against the population mean or any hypothesized mean so that valuable it has to be normally stupid. Otherwise, you cannot go ahead with one sample t test. Similarly, the variable values should be free from all clash that missed the were able does not present any significant or place. Okay, so to demonstrate the one semper T test. Ah, I'm going to use the same data state or some remedies college data frame, which is ideas. And I'm going toe. Okay, let's quickly see the data set. Okay, so in this, I'm going to see whether this is simple. I'm going to use this variable, and I'm going to check using the one sample t test whether the Sepulveda is significantly different from four. So you have seen these values. So I'm going to check whether this this Ah, this value is significantly significantly different from four. So the four numbers is any hypo any? No. Any number of red went toe put into there and check. Ah, treat. Okay, So Ah, So basically, what we're doing is we're checking the mean off our sample eternal value. Okay, so the null value is four over here, so you can choose any value. So But the null values is usually the population mean, so we're checking the mean off her sample with the null value. Usually the population will okay, so toe to go ahead with this one sample t test, you have to use the function called deed or test. But before using the function, I must tell you that you have tow, install a package, just go and check whether that packages already stored in your system or you have to install. So just go ahead and, uh, follow my curse it and just go here, click on the packages and you will see the list of packages. Just type stats and see that this packages is stalled in your system or not. So if it is not in start, so you know, how do you stole that? You just have to ride the function, his total package and within double courts just right stats and then run it. So it will autumn automatically install the package for you. Okay, so, no, let's get back to the the court. So the function is to door test, okay? And the parameters are I leased and simple with since I'm using simple Roberto check when it is significantly different from four demo and it is a two sided detest. Okay, so this this this parameter, is quite necessity. When you do, this one's embodied us that you have to mention alternative called you decided comma miracle before so knew is nothing. But you're, um a kind of cooperation meeting. It is another value. So there's a operation me. So if I around this so you will see the values over here. Okay, so they're a bunch off values so quickly. Just browse through them. So, as you can see, uh, you have guard the P value over here, and alternative hypothesis is that the true mean is not equal to four. So what does this mean? So basically, P value is quite significantly lower than the Al Fatah. So the default al far the significance level is 0.545%. OK, so it is this less than Alfa. That means by general rule off Ah, hypothesis testing If the value is less than Alfa Regional apparatuses. Okay, so religion ally POTUS is to accept The orderly hypothesis is in that case, in our case, it is that true mean is not equal to four. Since P values there's an Alfa rejecting ally practices and we're accepting middle order. The hypothesis and the orderly high POTUS is is the true mean is not equal to four. Then you have these values which is, um you're confidence in naval values. And also this represents your ah, the sample mean which is 3.57 Okay, so this is how you're going toe toe the one sample t test in our program 9. Correlation: Hello. Now let's see how we can calculate the correlation between two variables using our problems . So what is correlation? The correlation is a by variant analysis that measures the strength off association between two variables or other to new many variables. I also tells the direction of the listen ship. So in terms off the strength of the relationship, the value of the correlation coefficient varies between plus one and minus one, so a value off plus one indicates a perfect degree off association between the two numeric variables. Similarly, a value of minus one indicates he bigly off association, which is in the opposite direction. So one moves in one direction salmon ministry, other variable wolves in the other direction. So as the correlation coefficient value goes towards zero. So we have talked about plus one and minus one. But when the coalition coefficient is zero, that means the relationship between the two variables will be vicar or there is no relationship at all. So the direction off the relationship is indicated by the sign off coefficient, so a positive sign indicates a positive relationship on the negative. Same indicates a negative relationship, So usually in statistics, we measure four types of coronation. So those are Pearson Correlation, Kendall, Rank Correlation, Spearman's correlation and the point by serial correlation. So now let's see how we can collaborate the correlation in our studio. So to demonstrate that I'm going to use the same data said that we have seen so far, which is Eilis. So let's see the data set once again just for continuity. And ah, is I have the first lorded. Okay, so now we can see the new does it. So this is the data six. So it has five variables on it has 100 and 50 observations, 100 entries and four numeric variables. And when categorical variable. So let's get back to the court. And let's try to find out the correlations. Soto, To find the correlation, we have to first install a package. So in the last week, we have seen how we can do that. So again, I'm going to do the same function, which is nothing but a starter packages and the name of the packages Psych. Okay, so ah, since this package is already present in my system, so I'm not going to run this particular function, but ah, as fun as Ah, as far as going by the example is concerned, we have toe, you know, install this package first before using the particular functions to find the correlation. So I suggest that you please install this package. Just type is told her package and name off the packages Psych control enter. So that will install the package for you. So and the next step is to activate or to use the particle package. You have to use another function which is called library. Okay, So Libretti and I have just mentioned the name of the Beckett control enter. So as you can see, there is no error. That means ah, what it is. Um, perfectly fine. It is running in the background. OK, now let's get back to the business. So what is the function? To find the correlation. So the function to find the correlation is nothing. But see, you are dot test and running the perimeters. So let's suppose I want to find the, uh um between these two variables which is separate out lent and simple dark with. So I want to find a correlation between these two variables. Okay, So just for the example sick I went toe show you so cored artists you are not. This is the name of the function and the perimeter is Irish again separate Lord Land coma. And we have to give another variable. So that would be simple lot fit. So if I press control, enter so you will have the correlation coefficient in front of fuel. So that is minus 0.1 toe. So like I already mentioned so it is almost close to zero. However, it is negative but is almost close to zero. That means actually, there is no correlation between these two Particular variables are concerned, so so that what it means and the same beside is 1 50 a probability value. This probably value is 0.15 So which is obviously grander than Alfa, which is the significance level, which is 0.5 Okay, so what does that mean? So it means that since the probably value of the P value is is not less than Alfred is more than Alfa. So that means there is no association between these two. Very, but so there's another. This is another better toe. Verify that whether there is actually any correlation exists or not. So So if if we reject the null hypothesis is so when p values less than Alphaville original hypothesis, the null hypothesis is nothing. But there is a correlation between ah, two numeric variables and the ordinary high forties is is there is no relationship between two numeric variable. So since we are rejecting the null hypothesis, that means there is no relationship between cool, very words. So it is the another matter to confirm your hypothesis. So So, no, let's see a kind off a scatter plot, actually. So just to confirm visually what exactly you actually have got by that, uh, correlation figures. Okay. Soto do that. We have to use a function called plot. Okay, on. Bad images are the same. Like we have already seen. Control enter. So heavy havoc. So, as you can see, it is quite a clear picture off the two variables. There is no associations. All the data points are scattered all over the place. And as you can see, there is no absolutely no relationship between this. Trivedi was okay. So it is not even a positive minority were negative. So it is almost zero, you know, correlation coefficient. So there is no relationship between There's two. Very, but so this is another way toe. You know, uh, check visually your by video variables, whether that is in relationship exists or not. 10. Detecting Outliers: whenever we conduct any hypothesis testing around the mean, which are generally detest. One Semper T test or two campers December t test or appear to test for an artist. So so its able hypothesis testing. We're doing it on the mean So they're basically very two important assumption that ah has to be fulfilled so that the results off the best are valued. So one of them is still on normality. Assumption so that in the previous video you have seen how you can calculate or check for the normality assumption. And the second most is how you can detain any our players for a particular variable. Okay, so particular variable off our interest. So it is very important for any hypothesis testing. So before conducting any hypothesis testing, we must check for these assumptions. Okay, any hypothesis testing that is done around the mean. So in this video, you're going to see how you can check for our class. So how to dictate our clouds in a data set or in a variable with the help of standardized values? So So let's compute. So this is how you can put a comment in our studio. So let's compute the standardized values off the variable. Okay, so how we can do that? So let's see. Ah, first again, quickly see the data from So this is the day different with 1 50 observations and five variables. So let's see whether, uh, whether it is there any outline present in the variable simple dot of it. Okay, So for that, I have to create a new variable. Let's name it as, ah x rays that you can name it anything. Okay, so and toe get a standard of value for that particular variable. We have to use the function called skill. Okay. And the Fatima just are the name of the reader frame dollar ST and suppose we want to check for simple dot witt. Okay. And these are the additional parameter, which is called scale equal to true. So it is very, very important that you always keep this parameter is true to get those 10 today's value. Okay, If I run this, a new variable is created called Express It. So this is our standardized values. OK, so Okay, let's quickly again. Check. Ah. Okay. So this is our standards values. So it is has won 50 observations like you already know. Okay, so now let's get back to the court. And Ah, let's Ah, do a sorting. Okay, So I will tell you why you want to sort the data's okay. So again, let's put the coming over here. Ah, sort the vector. So this is basically a vector. So that explains It is basically a vectors OK, in decreasing. Okay, so we're creating a sorting the director in decreasing because you want to, uh, get a vector again, in which all the values starting from the highest to the lowest as it is in decreasing order. So in the previous video, you have seen the normality assumption when we have coverage for the normality test that disappear will test. So there you have seen in one of the output debt in the 95% confidence interval there, you'll get to values. Okay, So in those two values you have to check after sorting this particular vector. So any value which is lowest, then the the value that is lower in that conference in double and any values which are highest, then the highest value in that confidence intervals. That was in the confidence in the world we get. For example, who is the lowest value and trees the highest reading. Just one example. So you have to check after sorting with in disaffected. Are there any values which are lower than two or higher? Higher than three. So those are your out less because that goes there, use false outside off that your confidence interval. Okay, so now that sort how you consorted, we have to use the sort function for that. And the parameter would be your newly created victor. Okay. And order is decreasing. Decreasing quality through. Okay, let's quickly run it. There is some error. So this is, uh you will get a za vector over here, as you can see. Let me just change it. Okay, so this is a better view? I think so. So the cordis again, the function is sort and the name off the vector that you have created previously using the scale function and then decreasing will do group fucking capital city or capital? True inverts. Okay, so this is the victor you will get and you can check starting from ah, the top most row. So all those values which are higher than the value that you get from? No, my test. Okay, those are you r Blair's? Similarly, if I scroll down. So any value which lies below the lowest value in the confidence interval in this particular data frame or in this particle a vector are the other guys. So this is the way it very, very easy way actually to cooperate or to just visualize your outliers in the variable, Uh, which actually creates a part off your assumption for doing the other kind off test, which are your detest. Ah, whether this one. Sample t test two sample T test Bertie test and our test. So this this is a routine you have toe check for normality as well as okay? 11. Frequency Table: Now let's see how we can build a frequency table for the categorical variable using a package called Player. Okay, so, as you all know, before using, any packages have to install it. So in my system, it is already installed. So I'm not going toe run this part off function, which is the story of packages. But if it is not in your system in your our studio, please check it in this package list Looking, you can just make a search from here. And if it is not there, just install it. Okay, So you already know the name of the package. He's apply it. So this is the package. Okay, So once you have installed the package, the next time you already know, just have to activate it for your use by using the library function, okay? And the name of the package. Okay, so if I just run it so there is no era that music is running in the background. So now let's start with ah, the process of creating the frequency table. So in this example, I'm again using the they seemed it does it that we have been losing so far, which is the ideas detested. So in the islands, Data said, we have a categorical variable guard species. Okay, so there are just three categories. When one is said dosa, you can see we're here. And the 2nd 1 is Ah, varsity color. And there is 1/3 1 which is called Les Verjee. Nika. So these are the three categories off the categorical variable species. And let's create our frequency table for this particular variable looking. So how? Begin do that. So the first step is the first tip off. Leading the frequency table is there's got to comment so that Okay, so let's Baird Ah, the initial table with the absolute frequency, which is nothing but simple conchs. Okay, so let's, uh, create a variable. I was stored those values in the variable, which is called this ABC and have to use the function called Don't. Okay, so after use the function court count. So in the count, the name off the The difference is our first perimeter going on. And in single inverted courts, we have to specify the name off the variable. In this case, it is species. Okay, If I run this on deck sprint, the result Sprint, ABC Control, enter. So here you can see the absolute frequency table. OK, so that is the absolute frequent stable where we have the valuable species and the lady frequency for that particular levels. Okay, so there are one for jobs. Divisions like your dino. So for each level, you have 50 observations for for my particular species. Okay, so, no, what is the second part? So the second part is where the second part significance David, which is toe get the really difficult NSI. So this is the absolute frequency and how we can calculate the ratification. See? Okay. Sooners commended. Quickly compute the percentages. It is nothing but your related frequencies. Okay, so, lex name over Victor as PRC, it is nothing. But you may be a person to took a BRC on and the first table that you have created and you have to access the There are two specification toe columns. One species are the ones frequency. So you have to access the call frequency, then divided by you. Number off rose. Okay, then they're nothing but the number of rows. Or you can call it number of cases in your original data friends. So Andrew. We denote the number of fours and the name off your original data from now. Case it is Isis control. Enter that sprint. The desert. Okay, so this is the This is the real differences. OK? So, as you know, from the absolute frequency, there are 50 observations for each level out off 1 50 So, appointees that the differences for each label off, or each kind of species in our do you know, if it so no one is it Hard step. So the tired state or the third component off frequent stable is the cumulative counts. Okay. So we can cover ready cumulative cults. Okay. So, to compute, the community counts again, creating a vector. Say humility. Okay. And I'm using the function comes, um, so this is there a very general function off base are. So I'm using the function, comes up on the parameters, would be your first table and the frequency column. Okay, so this is this will give you the cumulative cults. So if I run control, enter and let's bring. Okay, So this is the cumulative consecrate. 50 101 50. Then what is the fourth component? So you have to get the humanity of percentages. Okay, Soldiers compute e percentages. Okay, so how we can do that? So let's create another victor and let's limit us Doom percentage or for short BRC. Okay. And using the previously generated been able askew mu native divided by your number off Rose, which is in row. Okay, find giving the original did of him. We just ice. So if I run this on, let's sprink the Grizzard control enter. So this is the desired Ah, cumulative percent. He's okay. And the last in the final stages to add all these components, the cumulative counts and the percentages to the initial table. So the initially released nothing but your the table limits, ABC. Okay, so let's add all these components which are your cumulative counts and the percentage iss the initial table. So how we can do that? It's quite simple. Okay, ABC, I'm using the function see bind, which is nothing but Columbine. And the confidence would be just before elements that you have generated so far. So which is nothing but a B c coma humility B r c endure humanity, but in these Okay, so if I run this Onda, let's bring the result. Final dessert. So here we have it. So this is your final community or your final frequency table, where you have the name of the variable here. The species, the frequency Do nothing with your absolute counts. Cumin a difference. See, Then you have. Ah, the then you have the ready frequency, and then you have the cumulative percentages. Okay, so this is your entire frequency table. So this is how you can generate off frequency table for a categorical very 12. Quartile: Now let's see how we can calculate quite well. You know what date does it? So a quartile is a statistical term describing our division off a data set into four defined intervals. So the first quartile, which is don't less Cuban, is defined as the middle number between the smallest number in your data set and the median of the day does it. Okay. And the second quarter, which is represented as Q do is the median other data sick, which is nothing but the middle most value in there does it. Similarly, the third quarter 23 is the middle value between the median and the highest window of the geodesic. So now let's see how we can find the quartile using our programming. So go find 1/4. We have to use a function card Somebody okay? It'll different. Them s. So far, we have seen so and the parameters are going to be the same. Using the same data said I lease and suppose I'm using the variable siple length. Okay, control, enter. So as you can see what here there are, if you set off numbers, which is nothing but your minimum first quartile median mean third quarter and the maximum values. So we have seen all those values in some form of the stop was, for example, minimum. We have seen 4.3 maximum that we have seen 7.9 mean median we have seen and the first quarter. So first quarter represents ah, value. So everything below that are up until 5.1 the president the 1st 25% off the data. Okay, that's what the first quarter represents. Similarly, the median, which is nothing but you're cute. So from 5.1 till five point, it represents 25 to 50% off their data. Okay. And the third question is a top most Jennifer person off your data. So as you can see in this data, the it is a well balanced data, since the median and the mean are almost the same. So it is quite a balanced sort of leader. So let's see hope there, see another, uh, let's see a box box. So So all these values are also represented in a box plot. So let's see how we can create a box parts. I just have to use the function box, blood and the parameters are going to be the same. Control, enter. So here we have it. Okay, let me in large a little bit. Okay. So this is our box park. So this, uh, this middle most value dart line represents the median value. Okay, which is nothing but five point it. Similarly, this is Q one quartile one, this is Clue three. Quarter three on. This is the minimum value, which is nothing but 4.3, as you can see. What here. And this is the maximum value, which is nothing but 7.9.