Transcripts
1. Data Analysis and Statistical modelling in R PROMO: The success of any data science model depends upon the comprehensive understanding of your data. And if you don't know the nature of data you're dealing with, you will end up building a useless data science model. Statistics has been around for many edges before the time of artificial intelligence and machine learning. Statistical tools help scientists to understand about the existing games and learn about the true nature of problems. So with this new era of data science and machine learning, statistical modelling has become most significant building block. If you apply the Died statistical modelling on your data, you can predict the future better and also find better solutions to your problems. Hi, my name is Joseph AECOM and I'm a data scientist having a master degree. In this course, we will use our programming language to perform data analysis and statistical modelling. I will start with the basic data distribution functions in this course before performing any sort of analysis, first, we will look into the foundational concepts when you have the proper knowledge. Then we will go to the R and implement those functions with real world examples. So in the first section of this course, I will cover the basics of data distribution such as normal binomial chi-squared distribution. And then the basic statistic concepts such as mean, median mode, range, standard deviation, variance, sum of scares, skewness, kurtosis, outliers. And then in the second section of this course, I will introduce you to different plots by using, you can interpret your data more clearly. And I will also show you which plot is suitable for which kind of problem. So after learning to visualize our data using bar plots, pie charts, histograms, boxplots, scatterplots, Mac lords, and many other plots. We will apply the statistical modelling on our data set. We will be covering many statistical test functions, such as proportional test, default, detest, independent samples, t-test, beard detest, F-test, ANOVA, Tukey HSD chi-square goodness of fit test, chi square test for independence Correlation Test, many others. And before actually seeing the real world example in R, I will walk you through the basic concepts, limitations, durability of these tests. This course is for anyone who wants to learn statistical modelling and want to perform statistical analysis on the data using R. So if you don't have the basic statistical modelling experience or narrative worked with the mentioned tests, you are free to join. This course does not cover just the theory. It will take the examples from the year world datasets. And by using statistical modelling, we will make inferences. We will perform hypothesis testing, two-tailed and one dared for different scenarios. And I have many exercises prepared for you so that you can learn statistical modeling by doing it. So if you have been working with SPSS or Excel and you have heard from someone that R is the master of data analysis and statistical modeling, then you should take this course. Feel free to check the outline of this course. And I will see you inside.
2. Data analysis and statistical modelling in R Intro: Hi, welcome to this course, data analysis and statistical modeling. In our, in the first section of this course, we will be covering different data distributions. First, we will learn about these concepts and then we will use R to see the different functions of these data distributions. In the second section of this course, we'll be working with blocks and charts in which I will show you different plots, working with different datasets, using plots for different groups and many more. This section will be very interesting. But the heart of this course is section three, in which we will learn about parametric tests. Some core concepts such as what is statistically significant means. B values, hypothesis testing, proportional tests, t-tests with its types, F-tests, f-distribution concepts such as some of scares, mean-square error, one-way ANOVA. Post hawk does such as Tukey HSD Chi-squared tests with its types, go deletion and correlation types. So we have many things to cover in this course. As I said in a promo video of this course, you will be having all the art coding files. So without wasting any time, start this amazing journey.
3. 1 Downloading and Installing R: High to begin working with, are you need to download R and RStudio from the internet. Vive, we need our studio. Rstudio provides are very interactive GUI to work with R, because R is a command line programming language. So in order to work with our layers first, download RStudio from the internet. So I'm on RStudio.com and if I go to products, click for our studio. Rstudio for desktop, I'm gonna go with the open source free edition. So clicking hair. Now, click on this free download link. Now it will ask me how I want the downloaded file. I have Windows. So I'm gonna go with the rstudio dot EXE file. If you are a Mac user or you are using Linux distribution, you can download these files and install the software accordingly. So let's go with the RStudio dot EXE file for Windows users. Clicking here. Click next. I'm gonna go with the default program files directory and installing it right here. After completing RStudio step venue are done. Click Finish. Now look for the RStudio in the apps. Run it. Now, even though I have RStudio installed. But it's asking me that it requires installation of R because I told you at the beginning of this movie, RStudio just provide you graphical user interface to work with R. So we have the graphical user interface, but we don't have are installed in our computer system. So for now, I'm gonna click Cancel and go to this, our website, cran dot r dash project.org slash bin slash Windows slash base. To download the Windows version of our just clicking on the latest version available, it's downloading right here. If you are a Mac or Linux user, you can always check on these websites to find the appropriate our file. Let's click on this download that file. Click OK. Next, same directory Program Files. Next, you can choose the user installation. I will go with the default settings. Click Finish. Now we have all the things which we need to work with our non, let's look for RStudio. So we have successfully installed RStudio and configured our language in it for just a basic demonstration. If I do a simple math like ten plus two, hit control, enter. I got 12. So it means R is working completely fine. Now the next time, whenever you want to work with are you can go to RStudio or you can pin this program in the taskbar, like me, if you will be using it over and over inside discourse. So that's how you install and configure.
4. 2 Navigating R studio: Hi, welcome back. In this movie, I will tell you the overall functionalities of different RStudio sections. So first thing first, let's learn how to create a file. Click on this file. Click New File, R Script. Then you can press Control S from your keyboard to save this file and rename it. Or you can go to the file or choose Save or Save As from here. Now, I'm going to save it inside this, the complete our programming course section one folder. And I'm going to name this file as V1. Getting started. So this is your R script in which you can ride and save things. This is the console panel. So first, if I have 12 plus 3, if I run it inside my art file, you can see I'm seeing the output inside the console panel. I can also run different command by directly writing it here. For example, I can do 2 plus 3 Control Enter to run my code five. So this way your data 2 plus t won't be saving inside your file. So if you want to run some temporary things like line, of course, you can always use this console tab. Otherwise, you should always write your script inside your R file. Now, the next thing is this environment space. Let's say I have this variable a, which I define as equal to 10. If I press Control, then hit Enter. You can see that, I, God, this is equal to 10 inside the environment. So this environment act as a memory of RStudio. Now the next thing is if I do like six plus a, if I run it, you can see I got the output of 16 detail because a is already in the memory in the environment. If I define another variable like B is equal to 10, this is another way to write equal 2. So 10 is pointing towards B, which means 10 is saying inside this be, if I run it. Now you can see this bead idea. And if you want to clear your memory, you can click on this brush sign, like yes, Done. Now you don't have a and b inside your memory. So if I just write a and press Control Enter, you can see a guard the error in or defined because a is not inside that environment anymore. So I have to put it again inside the memory. So running third line CA, if you want to see be done. Fifth line, C beat, I tell. So this is the Environment tab. You can always go to the History tab. To see what kind of data you had written a layer. So this is all the data which I had written when I was making this course. You can make a connection to some network. You can find different tutorials of R, or you can use this learn R package. So the reason why R is very much popular as compared to other software such as Excel or SPSS, is because the variety of functionalities are, has hundreds of packages. So packages act as a package box in which different developers, right? Hundreds and thousands of lines of code and put the functionality in that box. So the next time when you will install that package, you can directly use that functionality without writing thousands of lines of code. That's the great thing, isn't it? So another reason via I love R is because nine times out of a 100, I don't have to write custom code. I can look on the internet, search my problem, and I can find a baggage. Someone has already done DAG typeof complex coding and had put it inside some R package. So if I'm a vice person, I will search on the internet and try to find that package. So for example, I can go to these packages, them and here there are many packages. Even though I have already installed this package like mass and datasets. But I'm going to show you how to install a package, install dot packages, and then you can mention the package name. So I'm going to mention this mass. So it's saying there do you want to restart our prior to install? I will say and know as a package has already been installed. That's why I didn't go for it. Otherwise, you can run this command install.packages and passed a package name. There is alternative way to install your package. You can go to this Packages tab, click on this Install button. And inside here you can install from the CRAN repository or from the zip file. So I'm going to download it from the CRAN repository. Here you can mention the package name, for example, this package data graph. Click install. This package has been installed. Now in order to work with this package, we need to include this package. So if I search for this package data graph, you can see this package right here. I can click on this checkbox or I can use this library function and pass the data graph package name here, run it. Now this package has been installed. Now I can work with this package. You can always click on some package name. Then it will navigate you to the help section where you can find all the functions, all the information about package. So these are the functions available inside this data graph package. Many were done with some package. You can go to packages damn, you can uncheck this box data graph. Or you can run this command, detach package Garland data graph, unload. True. If you want to remove some package, you can click on this remove back edge button right here. Uninstalling this package from my RStudio. Dad, much easy. Now, this is the Files tab where you can find different files inside your current directory. If you plot something, for example, if I plot a comma b, if I run it, you can see that now I'm on the plots tab and this is my plot. So x, x is 10 as a most n and B is equal to 10 as well. So if you are not sure about some terminology, you can go to this Help tab. For example. I'm not sure where does mean means. So I can put a question mark and then I can put the name of the function, which I'm not sure of. So if I hit Enter, now you can see that Armando helped app and here's the information about this. At automatic mean function. I can look for another function, for example, I can say that what does done if, means, if I add on it, you can see that this is the uniform distribution function. And here I can find even the examples that aren't. So inspired. Packages plot. You can always click on this brush to clear your plot. If you are in the Help tab, you can use these arrows to navigate forward or backward. You can click on the Home, and here you can find different resources which can help you learn about AR. So if I go to this session tap Set Working Directory to source file location, you can see this command has been Run. Now, I can access all the files which are inside this directory of grunt 1. Because I selected the source file location as this movie was only meant for a beginner who have never worked with R. So I was giving you an overall idea of RStudio. As a goal progresses, v will be learning about many functions of r. So let's start this amazing journey.
5. 1 Math functions: Welcome to this new chapter. Our provides extensive set of built-in math functions, including exponential function, logarithmic function with different basis, square root, absolute value, sine cosine, and similar minimum and maximum function to find minimum and maximum value v dot min and which dark max to find the index of the minimum element and maximum element of a vector Benin and B max element wise, minima and maxima of several vectors. Sum product come some income product, which is basically a cumulative sum and product of the elements of a vector ground floor ceiling to show your output in a very nice way, factorial function. There are dozens of other math functions available in R. Let's work with some of the math functions in next couple of minutes. So here I'm on my directory and you can see this new section, mats, data distribution and simulations in which I have added this file one map functions. Let's open it. And now, now starting with this file, let's say I have this metrics. But 2468, 10 and one as vector elements and three number of rows and two columns. So 3 multiplied by 2, 6, and we have six elements in it, or isn't it? Now what if you want to find the minimum of this pulsing y? So the minimum is one that C over metrics as well. So this is our matrix. Now, you can also find minimum. Let's say I want to find a minimum in second row and third row. There's one. What if I go for second, first row? Store? Of course, this one. What if I go for be advised minimum values? So what I can do, I can use the function beam in, right? So in Second Rule, the minimum number is four and in third row one, now same thing, but this time, let's go with that column. First column and second column. So what it did, it went to these two columns and pick the first number from first row, second number from second row, and third minimum number from third row. Now let us also see another function. Let's say if I have this vector of x and I want to find the common narrative sum of x. Run it. You can see to 59 soul every time it's adding the previous one to the next one. And I can apply gum product of X. Multiplying in this case, like this, you can also find KM minimum of x and you can find gum mx effects. For the first time. Maximum will be two, then three, then four, simply the original x. So in short, there are many math functions available. I have bought a partial list of many useful our functions here as a comment so that you can use them later. And if you are not sure how to use some functions, you can always put a question mark and then mention the name of this function, for example, exponential for EXP. And then you can do Control Enter and let the explanatory job for our documentation panel. So it will have many useful information about that function. And also at the end you can find some examples to look into. Using these buildings are mathematical functions can save a lot of your time and you can focus more on the productivity of your tasks.
6. 2 Basic Statistic Concepts: Hi. Before we dive into statistic, I thought it would be a great idea to recall some of the basic statistical concepts. During the video. If you find this self very basic, you are welcome to go to the next video. Otherwise, hang with us. So the most basic statistics concept is mean, in which we add all numbers and then divide by the amount of numbers. For example, if I have this series, 9, 3, 4, 6, 8, 9 to 1, and I want to find a mean. So first thing first, I have to add all these numbers and then divide by the total number, which is eight. So we have eight numbers. That's why we are dividing it by eight. Then this will be our mean, 5.25. Next thing is median. So we arrange our numbers in ascending order, and the middle number is basically our median. Same series. We have arranged it in ascending order, and now we have to pick one number. So forensics is in between. If we skip 1, 2, 3 from left side, and 8, 99 from the right side, then we will left with 4 and 6. It's up to you whether you want to pick four or whether you want to pick six. But I took a mean from 46, the middle value, which is 5. So median is five. Next thing is mode. The most common number, same series. The most common number which is appearing more than one, and have the largest number of existence, which is nine. In case if all the occurrences of our numbers are same, then there won't be any mode at all. Next thing is range. So the difference between highest number and lower number, same series, difference between 9 and 1 is 8. Now next thing is standard deviation. So basically it tells us how spread out our data is. For example, if I have a dataset and I draw my data set in form of a chart. Let's say we have a normal dataset. We will talk about normal datasets in our upcoming movies. So don't worry at all for now. When we plot our normal data set, we will have this bell curve shape. And the reason why we have a great arise in the middle is because most of our numbers are in the middle part. So on the y-axis we have number of occurrences. On the x-axis, my yard values, and these are our all values. So the right side, we'll have positive values. And negative values would be on the left side and 0 is in-between. If you plot your normal data, then you will have a mean of 0. So this 0 is our mean value right, in between. And we divide our data into chunks to measure how far our von junk is from the mean. Well, there is another terminology which we will discuss later, which is SAT score, but I'm not going to go into that for now. For now, we divide our data into chunks. The middle one will be mu, which is mean, and then plus one alpha plus two alpha plus 3 Alpha. So this alpha is our standard deviation. So standard deviation from five to 0 means from five to over mean is 1 alpha. And then from 11 to 0 will be two Alpha. Then from 16 through 0 would be three alpha and negative standard deviation on the left side. So these alphas are our standard deviation. The longer we go from our mean, the highest standard deviation we will get. Now, if we square our standard deviation, we will have our variance and it will always be positive, otherwise, our values will cancel each other. That's why we have another terminology for standard deviation, which is variance. And it's some measure of dispersion, same as standard deviation with the positive value by squaring it. And large value of a variance represent great dispersion. If we add all of our variances, then we can call it sum of squares. Next thing is skewness, which is the mayor of symmetry or lack of symmetry. Our data set is symmetrical if it looks the same to the left and right of the center point. So the middle one is symmetrical data, which is a normal distribution. Right diagram is positive direction symmetric or data. Left diagram is negative direction symmetrical data. So the symmetrical data or our data distribution which is normal, we'll have mean, median and mode at the same point, which means they have the same value for mean, for median and 4 mod 4 are normal. This middle part distribution. Next one is kurtosis, which is basically I'm a year of whether the data are heavily tainted or light tailed relative to a normal distribution. So in the first diagram, 0 kurtosis with the Gaussian distribution, which we will learn in our upcoming movies. The middle one is positive kurtosis and the right one is negative kurtosis. So from the plot, the first diagram 0 kurtosis is a normal distribution. So you can think of all Gausian distribution as normal distribution. The dotted blue line also represent normal distribution in our middle and right plot. And the data set with the high kurtosis tend to have heavy tails or outliers. Now what is outlier? Outlier is r value which does not match with the rest of the values. Or if you plot it, it could be lying very far from the normal values. For example, in this plot, this 2.5 value is an outlier because all of the rest of the values are no more than 0.8 from the y-axis and their difference is very minimal for the first value is 0.1, then 0.30.45678. But for this value is 2.5. So this 2.5 does not lie with the rest of the values. That's why it's an outlier. If you want to find more information about basic statistic concept, I have put very basic websites which will give you very easy to get information about basic statistical concepts. I have also included some of the U-tube free resources if you want to check out more. As this course is not a statistics course, it's a statistic programing in our course. So we will be precisely working on the are other than spending much time on basic statistic.
7. 3 Statistical Distributions: Hi. As we are moving forward, we will add more statistical power in this course. And the foundation of statistic is understanding data distribution. In R. We have density or probability mass function, cumulative distribution function, condyle distribution function, and random numbers. These are the data distribution functions in R. But before learning about these, we should familiarize ourselves with the basic statistical terminologies. So starting with random via random Ys statistic evolves around randomness. The concept of randomness and probability are central to statistics. It's because most of the experiments and investigations are not perfectly reproducible. So this level of ER, reproducibility may vary in different experiments. For example, in physics, you can create data which is accurate to many decimal places, whereas data on biological systems are typically much less reliable. That's why the view of data as something coming from a statistical distribution is important to understand statistical methods. In this section, we outline the basic ideas of probability and the functions that are has four random sampling and handling of theoretical distributions. I will divide data distributions into two parts, discrete random distributions and continuous random distributions. When looking at independent replication of a binary experiment means 0, 1. You would not usually be interested whether each case is a success or a failure, but rather in that total number of successes or failures. Obviously, this number is random since it depends on the individual random outcomes. And it is consequently guard our random variable. In this case, it's a discrete valued random variable that can take values from 0 to n, where n is the number of replications. Now, some data arise from measurements on an essentially continuous scale. For instance, temperature, concentrations, et cetera. In practice, they will be recorded to affinity precision. But it is useful to disregard this in the modelling. Such measurements will usually have a component of random variation which makes them less than perfectly reproducible. But these random fluctuations will tend to follow patrons. Typically they will cluster around a central value with large deviation being more rare than smaller ones. In order to modern continuous data, we need to define random variables that can obtain the value of any real number because there are infinitely many numbers which are closer to each other, the probability of any particular value will be 0. So there is no such thing as a point probability. As for discrete valued random variables. Instead, we have the concept of our density, which is the probability of hitting a small region around x divided by the size of that region. So in short, continuous distribution has point values and we cannot rely on point values. That's why we get help from density, which create a small region around our point value divided by the size of that region. Anyways, there are a number of standard distributions that come up in statistical theory and are reliable in R. It makes little sense to describe them in details here, except for a couple of example, let's say uniform distribution, which has a constant density over a specified interval. By default 01, the first data distribution which comes in our mind is normal distribution, of course, binomial distribution and chi-square distribution. We will talk about these in our upcoming minutes. So first of all, normal distribution. It's also called Gaussian distribution, where density depending on its mean and standard deviation, sigma or Alpha. This is the typical formula of normal distribution. I'm not going to go into details of this formula. Rather than that, I will get a help from this simple plot. So this will be the plot of a normal distribution with mean mu is in between and all the values are clustered, two words mean. So the more standard deviation you will increase for your normal distribution, the area of bell curve will also going to expand. You will be hearing normal distribution many times in statistic. Let's look into more detail. Let's say we have this normal distribution where the number of occurrences are on the y-axis and measured value is on the x-axis. These are our values where 0 is in-between. Positive values are on the right side and negative values on the left side. So if we divide our bell curve into chunks than it looks something like this. So every division, every chunk has a specific distance from the mean. We can call it alpha standard deviation from the mean, where mu is our mean. Now this whole bell curve represent 100 percent of our data. Let's say I have a dataset with the million values. And if I draw that data set, and that data set is normally distributed, which means it's the complete representation of our population. For example, let's say the population of Australia is 100 million. And I run this experiment where I went to 1 million people randomly and I asked them a couple of questions. Now, after analyzing people responses using statistic, then I will submit my reserve to my supervisor. I will say that the sample data was 1 million only, although there were a 100 million Australians on which I'm going to generalize this steady. But still i event only to 1 million people. So these 1 million people are the representatives of 100 million Australians. And if you select this 1 million people without any biasness, then you can say that these 1 million people are the representation of our over all population and our overall population was a 100 million. So if your sample, if 1 million people are the true representation of our population, our a 100 million people, then that sample data, dad, 1 million data should be normally distributed. And when we plot such data, then we will have a bell curve, something like on the screen. Now, this bell curve, we will divide it into many parts. For example, this ADR, ninety-nine point seven percent of our total sample data. This area is 95.5% of the total data. This area has the 68.3% values. The whole dataset. We can divide it further. And every chunk will have the same value in the other direction when we consider this division. So this is a normal distributed data where all the values are random and they are closer to the mean. And mean will be 0 right in the middle. And also one more thing in normal distributions, mean, median and mode has the same value. Remember that this bring us to a statistical theorem which is central limit theorem. Which states that the distribution of sample means approximate a normal distribution as the sample size gets larger. So sample size equal to or greater than 30 are considered sufficient for the CLT central limit theorem to hold. And the key aspect of CLT is that the average of the sample means and standard deviation will equal to the population mean and standard deviation. In other words, a key aspect of CRT is that the average of the 1 million means and standard deviation will equal the 100 million people mean and standard deviation. So a large sample size can predict that characteristics of a population accurately. In other words, experimenting on 1 million people can predict the characteristics of a 100 million people accurately. Well, of course, in real world, if you want to generalize your experiment, instead of having just 1 million, you would have our data set at least 20 or 30 million if you want to generalize it on a 100 million people. But I just took 1 million so that you can understand it on a basic level. Now, after normal distribution, we have binomial distribution, which can be thought of as simply the probability of a success or failure outcome in an experiment or survey that is repeated multiple times. The binomial is a type of distribution that has two possible outcomes. Bi means two. For example, a coin toss has only two possible outcomes, heads or tails. And taking a test could have two possible outcomes, pass or fail. There is a criteria for binomial distribution to hold, which is the number of observations or trial must be fixed. In other words, you can only figure out the probability of something happening if you do it a certain number of times. This is a common sense, right? If you toss a coin once, your probability of getting a tail is 50 percent. If you toss a coin 20 times, your probability of getting a tail is very, very close to a 100 percent. Each observation or trial is independent. In other words, none of your cars have an effect on the probability of 10 next trial and the probability of success tail heads failure paths is exactly the same from one trial to another. Last major distribution is chi square distribution, which is used to compare if two samples of categorical data come from the same population or follow a given population. We can use chi square distribution. So in short, we can divide our statistical distribution into three major types. Normal distribution, binomial distribution, and Keisker distribution. Now these functions, density slash PMF, CDF, quantiles, random numbers are the further division in our novel before diving into these functions such as denom, D binom, day, chi-square, be nlm, p binom, et cetera. Let's first learn about these concepts such as what is density, what is cdf, what is quantiles, and what is random numbers? In our next movie.
8. 4 Statistical Distributions Functions: Welcome back. So these were our distribution functions with respect to density, CDF quantiles and random numbers in R, where d 4 dot density or probability mass function be for the cumulative probability distribution function, which is cdf q for quantiles and are four random number generation. So let's say if you want to create a condyle binomial series, then you will go with this function Q by norm. And if you want to create a random normal CDs, then you will go with our norm in R. If you want to create a chi square distribution, then you will go with the function q guy SQL parenthesis, which is chi square quantile function. Now let's learn about these density CDF quantiles and random numbers in very basic level. So for density, the density, we can divide it into two parts, continuous distribution and density for discrete distribution. So the density for a continuous distribution is a measure of the relative probability of getting a value close to X. Because in continuous we have point values. That's why the probability of getting a value in a particular interval is, might be the EDR under the corresponding part of the curve. Density for discrete distribution is something the probability of getting exactly the value of x as the dam densities used for the point probability. The probability of getting exactly the value of x. Technically, this is correct. It is a density with respect to counting measure, not commutative distribution functions describes the probability of hitting X or less in a given distribution. And for quantiles, quantiles function is the inverse of the glomerular dim distribution function. So that peak on dial is the value with the property that there is probability be of getting a value less than or equal to it. And the median is always 50%. So for the a 100 percent data, 25 percent is one condyle, 50 percent is two condyle, and 75 percent is three. Random numbers are computer generated numbers. And in professional statistics, they are used to create simulated datasets in order to study the accuracy of mathematical approximations and the effect of assumptions being violated. Now back to our functions. Density, CDF quantiles and random numbers for every major normal binomial and chi square distribution. Now, in the next movie, we will work with some of these major functions and I will try to bring a real life applications of these functions. So see you there.
9. 5 Data Distribution and Simulation in R: Hi, Let's work with these statistical distribution functions in this movie. So I have this file do data distribution and simulation dot r, which you can find in the same for a lot of this section. Let's see if I have this vector x, which is equal to this sequence, which is starting from minus it up to positive it, and by 0.1. Now let's see our x. So basically we have all values starting from minus eight up to positive it. And every single value is 0.1 far from the previous value. Now, if I just block to all of this data, you can see that it's basically a straight line. But what I'm interested in is blurting this in our density normal distribution format. So I will apply this function dnorm on this X, and let's run it for now. So now you can see that this is our diagram. We did normal distribution and I can also define a pedometer of by pair. So making the type to be L, which is line type. Now you can see this type l. Now what if I don't have this CDs at all for that? If I add, remove all of that from my workspace and now try to access this x. Of course I will have error right? Now, if I just see this blot, you can still see the error. So let's run it. I can also give it some standard deviation and mean. So for this, I can say that the norm of x with the mean of 0 and standard deviation is equal to one. Now if I denote, this is our data. Now what if I remove this x from the workspace? And if I tried to blooded, you can see that we still get the error because there is no x defined instead of flooding, you can also show a gov with the normal density normal function. So d norm of x then busing my ranges. So I wanted to start from minus nine and go to positive nine. Not any good. So this is our normal density normal distributed graph. Now, all of this was for continuous variable when you want to plot your continuous density. Now for discrete, Let's say I have this sequence of x from 0 to 60. Now plotting this discrete vector by using the density binomial function. Which will take this ECS. And also I have to define the size of this plot, which will be let say 50. And probability of every value is negative 544. And then for this plot, I can define the type is equal to L. So this is our plot where we have this size of 50 and probability. What if I clear my console? And now run this plot? You can see that it's not working because we have to initialize x first and then we will be able to draw this diagram. It seems like normally distributed, but in reality it's not. Although you can notice that bell curve over here, same blot. I can also change your style to hedge for bin diagram. Normally it looks good for binomial type batch, you can define different type of curves by using these functions. For example, I wonder define this gov, which has values from, Let's say three to positive three. So this one would be negative. And let's make the type to be etch. And I get, of course we have to portrait in double quote. So this is the normal density distribution curve from minus 3 to 3. Now, same thing, I can also add meaning to it. So adding the mean. Now if I run it. Now you can see that by assigning a mean of two, it has shifted all of the data towards the, the right side. So this is negative skewed gov, same gov, insert off mean if I give it a standard deviation of dual and let's say make it from 2.522.5 negative to positive running it. Now you can see that it has a bigger occupying all of this. So standard deviation is stool for this graph and it's a density normal GV. So the specific, you can adjust this curve to make it normal. Because in normal curves, most of our data lies towards the center and the edges are not as wide as this one. So we can adjust this by increasing or decreasing the ranges of this gov. I'm going to add a comment over here. So let's copy this line 26 and paste it right here. And instead of 2.5, let's make it minus 2.8 to 2.8. Running it. You can see that it has a bit shrinked as compared to the previous shape. Let's copy it again. And let's make it from minus three to positive three. Not copying it again. And this time let's make it minus 3.5 to positive 3.5. It has shrinked again. Now let's make it up to minus 425, up to positive 45 shillings, again. From five to five. Again. Let's make it two minus six to six. So now I think it's completely looking normal to me. It goes all the way. And in the mean part, in the central part, you can see that it has the maximum values. Now you may be thinking that why we are doing all of this. Suppose you have certain data with specific standard deviation and you want that data to be normal venue dried for the first time. You get a shape like 930, like this. And then you have to figure out how you can make the most use of this data by converting it to normal. For that, you can adjust your ranges and you can identify two which sample size you might get a normal distribution. And you can see that your sample data is the complete reflection of your overall population. In dash scenarios, you can use these curves and these normal distribution functions to get an estimate what's the best range to have a normally distributed sample data? Now, let's see our cumulative function, which was spin on. So L0 norm of 0 would be 0.5 and b norm of 1.960, which is the Z-score, which we'll be learning in our next section will be in 97 percent and Sam P naught cumulative distribution with the minus sign, this will be the leftover from the positive one. So basically p-norm accounts the whole diagram, whole normally distributed data. So if you have 1.960 value, you will get 97% of your data. This 1.960 is desert score and it's Xcode is the number of standard deviation from the mean. Our data point is this is what z-score tell you. So z-score of 1.960 tell you that your data point is 97% of V from the mean. And Z score of minus 1.960 can tell you that your data wind is this percent away from the mean. So let's read for the right place to come. Then we will learn more about this binom and z-scores. So I will add a comment for quantiles. And we will VOC with quantiles and common narrative in our statistical test section. And I will complete this file over there so that you can truly understand what I'm doing over here. So lastly, for random, I want to generate, let's say, 10 random numbers which are normally distributed. So these are 10 random numbers. It can generate ten random numbers with the mean value of, let's say seven and standard deviation five. These are our numbers. I can generate random binomial numbers then with the mean, not for the mean for binomial, you have to give the size. So I want to have a size of 20 with the probability of, let's say 0.5. So these are random binomially generated numbers with the probability of 0.5. Let's end this lecture for now in these statistical tests section, when we will learn about z-tests. And we will cover some of the part for cumulative distribution of our normal GV, then I think it would be a great idea to discuss these concepts. Because even if I explain it right now, if you don't have a basic knowledge of z-scores, then it would be not fair to all those people. So for this common entertainment quantiles, let's wait for the right time and we will come back again to this file very soon. So hang in there.
10. 1 Bar plot: Hi, welcome to this new chapter of plots and charts, in which I will show you how you can understand your data graphically and how to represent it in a more interactive way. Sobriety in our directory, I will create a new folder with the name of sections Celan, Lord sand charts. And let's create a new file in the same folder of Section 7. And I'm going to name this file as one bar plots. So bar charts are best for representing your categorical variables. For example, if I cleared this bar plot and taking a sample of one to ten, if I run it, you can see this bar plot. If I just run this simple statement, you can see that we have five numbers collected from the range one to ten. No simple thing is, you can give it some sort of color as well. I can give it colors from the rainbow family. Let's give it five colors. For the neck. You can see these colors. And every time, whenever you run this sample, it will create new numbers for you. See. So I use bar plot to show you some number. The next thing is, I'm going to show you how you can realtime work with the data set. So I'm pretty much sure you have already installed this package mass. So this time I'm just attaching this. And if you have not, you can always run this command install.packages and mention the package name which will be mass. And if you want to learn more about this package, what you can do, you can call this function library. Bars held is equal to mass shooter on it. It will give you the documentation of this package. What do you can do with this package? So there are many datasets available in this back edge. You can see this shifts. Notice that these shoes data set and this WT loss, weight loss dataset. You can also go to this tab packages. Look for your package mass. Click on this link and it will open up the description for all the availability of datasets or functions inside any package. So what I'm interested in, in this weird loss, if you just click on it. You can find the information about this data set. So it has days and weight column. And the dataframe gives the weight in kilograms of an obese patient at 52 time points over n, eight month period of V8 rehabilitation program. And you can always find the reference if you want to find more information about this tetris, a couple of our functions, which you can try. So let's work with this weight-loss dataset. Closing this time. Right here. Let's bring this data to our, our space. If I run it. Now if I show you this dataset, so it has days and weight column. So starting with the same simple statement, plot and I'm going to plot this data set weight loss. You can see that the weight is on the y-axis and the days are on the x-axis. And the V8 is very high at the beginning of this rehabilitation initiative. And with the passage of time, the weight is decreasing almost up to 110. So you get the basic idea from this. Let's use barplot to view this dataset in a more attractive. So bar plot in which in the bar I will equal the height to weight loss column of weight. And the names augment. Making it equal to weight loss. Days. If I run it, fell on it. You can see that it has given us the weight for every single day. If I zoom in n and r has created these bins in the first bin from 0 to seven. So this is the first week, and this is the amount of weight which has been dropped in the first week. So what you can do, you can add more parameter, for example, you can add xlab, days. Running it. And let's also give it a color. So I'm going to use heat colors. And let's use guaranteeing heap colors if you can see these squares. So as you increase the number of eight color index, let's make it 50. You can see that it's going from more red to white color. If I make it 14, 60, it looks better. The next dataset, which I'm going to show you is from the package data sex. And that name is chick weight. If I go to the Package tab, this package datasets. You can see this dataset, chick weight, in which we have couple of columns with time chicken died. And the dataset has certain number of rows and columns for an experiment on the effect of diet on early growth of checks. Now this time I'm going to do it in a different way. But first, let's see, the first couple of rows. So we have these four columns. So I'm going to take an aggregate between the weight and the time column. So right here, this variable will use this function aggregate, in which I'm going to use column wait. And take an aggregate with respect to time column with the capital T. And I'm using the other notation, naught D $1.1. That's why I need to provide the name of my dataset, which is Chiquita. And Dole function, which I'll be applying is mean. If I run it. If I show you my aggregate. You can see that at the beginning of this experiment, the mean weight was 41. And when the time progress to 20, the mean weight increased up to 209. So this time is number of days. And you can clearly see that with the passage of time with the diet, cheek weight has been increasing. Started with. 41 mean weight. And on that 21st day, they're vacate the ease to 218.6. So after taking an aggregate, Let's see this aggregate analysis in form of a barplot. So this time hide column would be equal to this aggregate of weight and our names. Or document column will equal to aggregate of time. The xlab, ylab and main title would be this. If I run it, if I run it, you can see that the average weight is on the highest side and the names argument has the time. Let's also add couple of more parameters here. So I'm going to start with lasts, which gives orientation of access level. So I will give it alas off one if I run it. So the moment I run, you can see that now it has horizontal orientation. The next argument I'm going to give it Horace, which can make this bar plot to be shown as horizontal x-axis. See. You can also give it some border. So water to, or something like that. And for colors, I can give it topo colors of, then run it. Let's make it 12. Sweet. If you want to remove this border, you can have the augment and there and there won't be any border here. And all of us, you can put the hash symbol to make certain parameter in December numbered. Okay? Now the last demonstration of showing you the data set is shuttering data set, which is available in our package mass. This shutter data set, if I click on it, you can see that this dataset has 123456 columns and it's inside our packet of mass. If I just show you a couple of rows of this data set, you can see that we have seven columns. What I want to demonstrate over here is our will only create barplot for numerical data set. In the first bar plot, men, we had weight-loss and days. It created these categorical bins from 60 to 74 to show us the barplot, even though these columns have numerical values. If I go to the next plot of this chick weight, these column weight and time also have the numerical values. If I run this head, this vacate and time have numerical values. So in this data set of shuttle, you can see that the shuttle has this column of use, which is categorical form, but it's in the correct representation, not a number. And if I try to barplot this type of column, for example, if I barplot shuttle use, I will of course get an adder. So we need to prepare data because our cannot create a bar chart from categorical variables. Instead, we must first create our table that has the frequencies for each level of the variable. So I will create a table. So this variable will be equal to that table. And mentioning my data set and this column of use. If I run it, have, I show you this use. Now you can see that we have total 145 auto. Then use N1, a 111 naught or values. Now, if I bar plot this Qs, and let us also give it some man and color. If I run it. Now you can see that we can easily present this finding in our presentation and say that shuttle use auto as more values as compared to no auto use. And it clearly match this numerical claim. So whenever you are working with the bar plots, make sure that you convert the categorical variables in form of frequencies for each level of the variable.
11. 2 Bar plots for groups: Hi, In the last movie, we have seen bar chart for one variable. Now in this movie, I will show you a bar charts for groups. Let's close this previous file. So we need library mass and the data set will be working with is Chiquita. It's already loaded. Now, I'm going to clear this variable, check, which will point to this data set. Check words. If I show you this check now, you can see that it has v8. And for every check, they were given certain diets. So these checks which have these weights, it, this horse bean diet. And then some of them had this linseed hand, some of them have this soybean diet. So I will create mean, reserved and we will take aggregate between the weight and Jake feet. And the function will be mean if I run it. If I show you my mean result, you can see that for every single diet, we have this mean weight. So if I plot this mean weight, you can see that this plot doesn't make any sense, doesn't it? So now we have this mean reserved in form of our groups. So the case in group has this min-weight, the horse mean group has this min-weight. And earlier in a previous movie, we do not have a group. Now how we can plot this? So first thing first, we need to remove this column because we can only plot numbers. So what I will do, I will remove the first column from this mean reserved and transpose the second column because these diets are associated with their feed group. So if I try to plot this video, we'll chick weight. Then R will think that these weights are not for groups. These weights are for this liquid column only. So I will create this variable organized mean. And I will take transpose of my mean reserved and remove the first column from it. If I run it, if I show you my organize to mean, you can see that now every single mean is at individual level. So first row, first column will have this mean. First row second, we'll have this mean, and then so on. Now, every single mean is being created as a group by transposing it. So now we are good to apply the column name of each of the group to these means. So using this call names on this organized mean, and I will equal it with the mean. Reserved first column. If I just show you the first column. So selecting it and running it, you can see that it has different level. So case in horse pain, line, seed, meat meal, soybean, sunflower. So now I can stick these names on the organized mean. So if I run it and now if I show you my organized mean, now you can see that case in has this mean horse being has this mean lines you'd has just me. So we did a bit of trach, removing the column and then transposing it to make the groups organized according to their diets. Next thing is, I can plot this. So organized mean. And mean will be feeds reserved. Next lab will be feeds my lab. If I run it. Now you can see that this is our plot. If I make it more bigger, you can clearly see that the most effective feed is Sunflower, and we can also confirm it from here, 328. And the last effective feed is horse beam. And we can also confirm it from here. That's also give it some color. So I'm going to give it color in form of a hexadecimal notation. All good. So don't worry about these exception handling warning messages. So we started with that data set, which has the data in form of a group. So the first column, both numerical and every single weight was associated with their diet. So what we did in this file, we first took the mean according to their respective group. Then we took our transpose and remove the first column. After that, we name it and then we plotted it. And the only reason we applied this transpose is because our data was in an unorganized way. It's creating all of our values as v1 column, although it was our group. That's why we converted it into individual level by transposing it. And then we assigned the name by using this function call names. So when you have data in form of our groups, you can always use this technique to seeding bar plot.
12. 3 Pie Charts and Graphical Parameters: Hi. Other than using bar charts, you can use pie charts to represent your categorical variables. So this movie, we will learn perfectibility and validity of by charts. So it has a very simple syntax. All you have to do, you can use this function by and in which lets add some random five numbers. But before that, first let me introduce this function run f. So this function will provide information about the uniform distribution on the interval from minimum to maximum. So mx dot dot f gives a density been f gives the distribution function called f gives a quantile function and run if which if we are using right now, generates random deviates. So if I just could you five uniform distributed number, you can see we have these five numbers. And if you go down, you can find more information and examples for this function. Anyways, I'm going to upload these five numbers using bio chart, and let's give it some color. I will give it color of the rainbow, five. If I run it. You can see that this is a pie chart. You can also give it some sort of that ADS. So to running it, well, it's quite big. If I make it one. If I make a plan C, Now, it doesn't seem like a pie chart anymore. So this is the simplest form of pie chart. Let's work with some real life data set. So I'll be using the baggage data sets. And inside this package, I'm gonna use our data of orange. If I can show you a real quick this data set. So going to the desert, clicking on it. And if I go to the O section, you can see this orange data, which has growth of orange trees. It has three columns, three, age and self-confidence. So let's see couple of rows. It's always a good practice before performing some kind of analysis. First, get the overall idea of your data. I have already worked offline before making these videos. That's why it won't take me so long to give you the findings about some data set. Otherwise, it can take minutes, maybe hours to understand the nature of data. Okay? So this data set three columns, so 31 and so on. Each second finance, if I go with the levels of orange and green, you can see that we have five types of Greece here. But what we are interested in making some pie charts out of this dataset. So let's make a pie chart of the three age. So first thing first, we need to convert these ages in form of our table. So right inside the pie chart, I will use this table and apply it on orange age. And let's give it some color running x. So out of many edges, we have made our table of our ages and we can see the upper portion of every single tree age. You can also add a label. For example, what I can do here, if I just copy this, and right after that table, I'm going to add labels here. So in form of a vector changing to a different color if I run it. Now you can see this and we can add two more here. So most of the names are just the dummy names. Now I'm going to show you how you can resize your image and just using one image space, how you can add margin to your shapes, whether you are using bar chart by chart or anything similar. So I'm going to introduce one function which is par. So it can be used to give margins. And it has many graphical parameters available. Like Marr, OMA and mf call MF row. Mostly we use these type of graphical parameters, but you are free to use any of the graphical parameter. So I will start with a simple graphical parameter, so bar and then pull em, which is used to set outside margins. For example, I can make it equal to one vector and I can start the outside margin to one comma, one comma, one comma one. So bottom, left, top and right. If I run it and let's copy this and first lecture on it again. And now I have pasted here. And now. So we just use the default margins. If I make the bottom margin to five, if our net. And then if I go for my pie chart. Now you can see that we have given it a margin from the bottom, which is five. And let's make it a bit smaller. So making it a default one. Okay, the next graphical parameter is Maher. So first adding the bar and then mar will be equal to 1 vector with some values. So one comma one comma one, comma one, and it is used to set plot margins. So if I copy this pie chart again, pasted right after it. Let's add couple of spaces. Now. First, if I run this power Mar and then pie chart. Now you can see this pie chart which has the margin from bottom, left, top and right. If I make the bottom to do and right to go, run it, then run pie chart. Now you can see that it has a bigger margin from the right side and also from the bottom. The next few graphical parameter I'm going to show you is MF R4. So far. So MF rule, if you go to the bar documentation, you will find information about them afro. So our vector of the form n are NCR, number of rows and number of column. Subsequent figures will be drawn in end and our BY NC array on the device by columns, MF, color, Rose, Emma Frost respectively. So in this parameter, you can equally it with one vector. And that vector could be one row and one column. And I'm going to show you one more parameter which is sex for lodging their title. So cx dot Me1 equal to talk for a minute. Now if I copy this and running this pie chart, so all of this space, consider it as one row, but two columns. If I run it, then when pie chart, now you can see that now we have the space to draw one more figure on this side. And sex dot Min. Let's make it five. If I run it, if I zoom this. So first let's give it some main title running. Okay, This one is quiet provider because of the cex.me. Make it one and then run it. Now you can see this tree slices. If I make three Burnett, it's getting bigger. So sex for lodging the title and MMF rotor divides a graphical device in rows and columns. So let's end this lecture right here with the most used graphical parameter learning. And we will continue to this pie chart in the next movie.
13. 4 Finishing Pie charts: Continuing pie charts. So right here, Let's first air the graphical parameter bar. So I'm going to add one x and also one more vector with the name of levels. Now plotting this data, so phi x and labels, and the main title would be pollution by cities. And call is equal to H0 dot colors. So if I run it first, I need to run this x, then liberals than pi. So you can notice that this is our plot and individually have provided some level. Now, I wanted to show you some more parameters. For example, you can add the angle, angle. So neat. Dark angle equal to 19. If I had a net, I have to I here, so it's in it dot angle. And I can also pass this argument clockwise is equal to true or false. So it will start at 12 o'clock in, start off three o'clock. If you just add this augment in it dot angle, so same like this. You can add multiple of Pi charts in the same figure. So I'm going to copy all of this and base your criteria and m of row, one row. And I want to have three columns. And instead of sex, dark man, I'm gonna make it one and would leave the rest as it is. So if I run it, you can see that we got the title on the top, pollution by cities. And I'm going to copy all of this again. And this time we'll change some values. And making that title to growth in economies and making the color to make it white. And then also mentioning it here. So now if I run it, you can see that we got this and other plot here got being hit again. Not changing it for Zant. And colors from 10 to 15. Running it. If I zoom in, you can see that we got this nice plot here. So pollution by a city's growth in economies and growth in economies, we can make it to population. And of course we have to start again because bottom Afro one comma three will have only three plots in one row. So first running this pie chart, then running this pie chart. And then I can run this pie chart. If I can show you. So this is our new plot. So same like this. You can add bar plots in the same figure. For example, you can set this par mf row equal to 2 comma 3 and 6 dot Me1 equal to one. And if our net. And next thing is, you can use bar plot in which you can pass our already declared x, y, and z. And Names argument will be equal to labels. If at annette, you can see this. No For via know for z. And we have two rows and three columns. So if I copy all of this again and run them, all of them, see. No. It's a great representation to have the place where you have to show two or more than two different plots on the same screen. And of course I can adjust, can name, label because I think the fourth one was our Singapore and there's not much space over here. That's why you don't see single book. Anyways, I would conclude this lecture by saying, pie charts are attractive but not recommended because it is very difficult for a human eye to feel the difference by slice as compared to bars. So if you just look at these bars, you can easily tell that paging has the highest number at this, Singapore has the lowest numbers. If you go for pie chart, it will be very difficult for a human eye to feel that difference between the slices as compared to bars. So try to avoid using pie chart. But in such scenarios where that difference is very extreme, you can of course use pie charts other than that, always use bar plots.
14. 5 Histograms: Hi, welcome again. In this movie, we will be seeing histograms, some of the histograms example we're already described in the previous chapter of dagger distribution. For example, I showed you how you can plot a histogram for random normal distributed 1000 numbers with this color. But in this movie, we will work with the histogram by using some real-world datasets. I have already included this library datasets. You can always go to the package and check it or uncheck it from here. If you go to this package, you will find the earlier data set which we use orange. So first let's work with some of the characteristics of this orange dataset. So it has three columns, age and circumference. So I'm going to use histogram in which their score to this data and plot the circumference and give it some nice color running it. So this is the histogram of orange circumferences. Now you can change that background off your histogram bars. For example, if I copy all of this, paste it here, and add some density to it. Let's make it 20. Running it. Now you can see that we have changed spectrum. You can increase or decrease the density to see are different background. Now, these bricks which you are seeing over here, these are controlled by R. For example, we have an aimed from 50 to a 100 and 100 to 150, 150 to 200. But if you want to use your own ranges, you can provide additional organ. For example, if a copy this histogram and remove this density, instead, I'm going to use other pedometer off brakes in which I can pass a vector. So 0 comma a 100 comma. Let's skip this one, 50 to a 100. And then I can pause and other number, which can be maximum of orange. Sitcom. Fairness. If I Make it a bit bigger. Now if I run it, you can see that we have these bricks, these breaks, 0 to a 100100 or 200. And then this is the maximum number of silicon fetus. If I copy it again and paste it right here, and make the brakes to a bit bigger. Let's say 500, a 1000, 15. First let's see the maxims, I'll confess. So just selecting it, running it. So the maximum is to 14 if I run it. Now you can see that as the maximum is only up to 214 deaths. Why? After to a 114, we don't have any data at this place at all. Right? So be careful when you are using your brakes. Get the overall idea of your data. Now, instead of specifying our own brakes, if you want to use the quantiles value, for example, if a copy it one more time. And instead of breaks equal to some vector, if I use this five number function, which gives the minimum first quartile, median. Third quartile and maximum number. So orange dollar circumference. And first let me show you. We're just going to give it. For a net. You can see that the circumference minimum is 30. The first core dial is 65.5. The median is a 115.03 quartile is 16, 1.5, and this is the maximum number, 214. You can also confirm it from here. So this is a very handy function, 5 num. So now if I run this 15 line, you can see that we have made our breaks according to the quartiles. So these are over breaks according to her quartiles. So in this dataset example, I just wanted to show you some of the parameters of histogram and the impact of using the diet amount of brakes on your histogram. Now let me show you another dataset example. So there are time I'm going to use is Lake Huron, which you can find in the data-set package. And if you go to L section, this dataset is right here. So it has the level of Lake Huron from 1875 to 1972. So first let's see the head of this dataset. So we got just one column. If we view this data set, you can see that we only have one column, which has the level of our lake from 1875 to 1972. Now this type of data set, how you can graphically show this dataset, of course, histograms. So I'm going to add a couple of spaces here. Now the next thing is, first thing first, let's start with a simple thing. So hist leg here on running it. So you can see that most of the time the lake level is between 570, A25, 80. That's why I'm we have more values over here and as there was only one column, so that's why we cannot specify in which year this was happened. More values, highest levels. So how we can make this histogram attractive for our presentation. So let's modify this histogram. So starting with the simple, which we have already seen, Lake Huron. And now I will add brakes, making it five bins, running it, making the frequency to be false, and giving it some sort of colors, cool to call giving it a title. So if I just run it now as PR for these breaks, let's modify the number of breaks. So I'm going to use sequence from 0 to 7 thousand by a 100, running it coma. So this one doesn't work, right? So brakes vector of these numbers and we get, so this one will not work at all. So if I just make my brakes, let's say 10. This looks better if I make it 11. Perfect. So the recommended bins are 11 bins for most of your histogram. And it seems like a normal, but it's not a normal dataset because it's not a bell curve, right? So the more our gametes, which you can add over here, you can add border. Let's make it 10. And you can also add density to it. So five. Now this frequency is equal to false. We'll draw a normal distribution. If I add hash hair or make it true, then it means our probability is equal to 2. Now, right on top of this histogram, I'm going to add the density curve. So gov. And I'm going to use denom and passing some x that will have the mean of our Lake Huron. And it will have standard deviation of our lake. And let's give it a color, black and line width of two. If I run it, I need to correct this plaque. Okay, so first thing, first drawing histogram, then this curve. Okay, So this doesn't work is because we have to have this pedometer add equal to true. So that discount should be added on this histogram. So first running this histogram and then adding this curve. So if the data was normally distributed, this, this is how it will be distributed as no, I'm going to add kernel density lines. For that. I can use this function lines and call the density function parsing the lake and giving it some color running it. So these are the Kernel density lines of levels. Black one is normally distributed curve, Greenland is Guarneri density lines. And if you want to adjust these kernel density lines, what you can do, you can add additional parameter, right, in that density. And four are just meant. I'm going to use the color dark golden rod one running it. So after adjusting the density, this is how our lake here on tensity will look like. Now if you want to show the points of data over here, what you can do, you can use the function rock. So rug, leg and on. And you can give it some color. Let's say red. Running it. Now you can see the red lines over here. So we use very simple sort of dataset of Lake her on. And first we adjusted the histogram, then we draw our normal curve. Then we showed the kernel density lines and guide to find the adjustment in it. Then I have just shown the data points are at. And so with a bit of coding, with a bit of using parameters, I have transformed my earlier version of Lake head on into very much promising. I will end this lecture right here, and I will give you one challenge to do so for you, we have this data set, us a rest. If you go to the package of dataset, you can find this data set us at rest over here. It has different sort of column. But what I'm interested in, in this urban population column. So you asked that test dataset, this urban population column, which has just the number. Now, I want you to draw a histogram of this urban population column, add appropriate number of breaks. And remember I told you that most of the time, ten to 15 bins are Enough. After drawing your histogram, you should add the normally distributed curve like this. And then, and then draw the kernel density lines after that are just it, and use this rug function on this column. And in the next movie, I will show you my solution. Suppose this movie, go to our work with this dataset. It's very easy. And I will see you in the next solution video.
15. 6 Understanding Urban Population of US using Histogram: Hi, Welcome to this solution. I hope you manage to complete that task. Now I'm going to show you my solution. So we'll be working with this dataset of us arrests. So I'm going to copy all of this data from the previous movie upto the basic histogram of Lake Huron, coping it, creating a new file, pasting it right here. Let's move this exercise for us, arrest data at the top. So first requiring my package datasets, no loading my data set. This is the column will be working with. So copying this column. And first let's see the basic histogram running it. So this is the histogram for the urban population of US. Arrests. Now for modifying this, basting the name here. And for a number of breaks, I will go with 10 and four colors. I will pick this color this time. So histogram of urban population, making the border to be 0 and commenting density. Now if I run my basic histogram, this is the histogram. Now next thing is drying because so basting the column name here for mean and for standard deviation, because we want to draw a normally distributed curve, the color would be black. You can see this normally distributed curve. No, drying the kernel density lines with the same color. Now on the next line, adjusting it. And of course, mentioning it here. On the next line, adding the drug, sweet. Now let's save this file as understanding urban population of us using histogram.
16. 7 Box Plots: Starting boxplots. You can use boxplots to show quantitative variables. And box plots can be used for one-variable or groups of variables. In this movie, I will show you working with one variable, boxplot. So including my dataset. And this time I'll be using this Iris dataset right here in the package from datasets. So it has couple of columns in this dataset is well known in the data science community. In fact, show you couple of rows of this data set we got for numerical column and one categorical column. So if I just ask for the summary of, let's say a one-off the column sepal length. I can say that IS then go for sepal length. Now you can see that we got minimum first quartile, median mean, third quartile and maximum. And as a quick reminder from the last section, if the mean, median, and mode is equal, then that dataset would be normally distributed. Anyways, boxplot has five rating of numbers and these numbers used to create the box plot. You can also get these numbers by using five num function, which I showed you in the previous movie. So Iris, go to the sepal length, running it exactly as about. Now. I will show all of DES without using summary or five num function. I can directly use boxplot and iris fan on it. So you can see that we got couple of columns, but what I'm interested in in this movie is only sepal length. You can also add couple of arguments here. But before doing that first, let's understand this boxplot. So the lower line, this line means minimum. The box represent core tile and third quartile, and this upper line is the fourth quartile or the maximum. If I quickly show you one more explanation. So I'm going to show all of the columns running it. Okay, now we got some dots here. So this line and this minimum line represent minimum. This box represents second, third quartile. And this dark line inside this box is the median. This top line is the fourth quartile or the maximum. And these circles, these circles right here are the outliers. So the cervical above the top line and below line are the outliers. Now, moving back over simple boxplot with just one column, Let's add couple of arguments. First, coping this. Now starting heading. All command. So I can make it horizontal equal to r2 running it. So removing this, running it. Now, the boxplot is in the horizontal form. Now adding some color equal two colors, ten to 15. So this one doesn't have an outlier. I wanted to show you how you can add color to these outliers. So sepal width. So let's go for sepal width and start running it. So this is our box plot. Now, let's add color to our outliers. Or first, let's add symbols. So for adding symbol for outlier, it would be out PCMH and 16, which is used for the filled circles. And if you want to give it some sort of color, you can use or would call equal to gray. Sweet. Now, let's make it vertical. And if I run it, let's make the labels horizontal instead. So loss is equal to one. Running it. Now, specifying the ranges for the y-axis by a viral limb. Instead, let's come into it. Now. If you want to add the viscal lines, you can use this and get. So adding coma. Now you can see that we got the line over here. Now if you want to add notches to the medium, you can use notch equal to grow. Like this. You can add one more parameter. Box. Next. We'll add the weight of box as proportion of original. Seems tight. Lastly, ending the main banking to be IDS, alerts make xlab and meant to me, I just started on it. Great. Now you can compare the width for different species. If I can show you real quick. Inside the Iris dataset, we have the column of species which has three type setosa, versicolor and virginica. So next thing is I'm going to add a boxplot, the iris sepal width and as a function of species. So first mentioning this species, giving it some color and main via lab and xlab me for a net. Now you can see that we got sepal width comparison for every single species. So instead of length, it's where, because we are using width not length. Great. Now lastly, I will copy this earlier boxplot. And instead of seeing for one width, for one column, I will show you for all of the columns. So just passing my Iris dataset, running it, Let's change them. You can. Good. Now let's also remove these lines. So I will add one more parameter here, stability. So it will remove the line from the end, minimum and maximum lines. So this was the great example of using boxplots to represent your data. In the next movie, I will show you how to use boxplots to represent your groups.
17. 8 Box plots for groups: Learning boxplots for groups. So I'm on this file box plots for gropes and the dataset which I will be using is from datasets package. If you go here and if you just search for chick. So this is the dataset chick weights, which has couple of columns. First loading my data set and checking the first couple of rows. So we have two columns, weed and feed. We have already worked with this data set earlier. But this time we will use it in box plots. So let's see the levels of this dataset. So going to the field. So casein, horse spin line, seed, meat meal, soybean, and some flaw for the feed column when any of the food was given to the check of it. And the purpose of this dataset is to compare the effectiveness of various feeds on the growth rate of chickens. So if I box plot this chick, There's data. So I will boxplot debate against chick feeds. That's why I used this tilde symbol to plot it against the field. If our net. This is the basic boxplot, Let's add couple of parameters. So I will start with coal opening it. Now let's also give it names. So names equal to this vector if I run it. So nothing really happened over here. If I zoom in. And first let's remove this. So even though it's extracting the name of every feed automatically, but it's a good practice to give names when you are showing the group data in your boxplot. Now, let me add some of the more parameter here. Okay, so I have added main title, xlab, ylab, and couple of more graphical parameters. So if I run it, you can see this boxplot. And it's obvious from the plot that horse mean was less effective as compared to the case in. So it's a better representation of group data as compared to bar plots. Because remember at that time we had to organize our data and we also took aggregate between these V8 and feet column. Then we used transpose and column names to assign the feet names. But in this boxplot, we just fed our data to this function. And the desk job was done by the box plot function. So that's how you show your group data using boxplots. And after using a package, always detach your package and always clear your workspace.
18. 9 Scatter Plots: Hi, I hope you are having a great time learning. Are we doing? In this movie, I will show you how to create scatter plots. They are suited for quantitative variables. And the simplest form of scatter-plot is using the default plot function. For this movie, I have picked a couple of datasets to work with. So if I show you my directory in the same folder, oh, Section seven, plots and charts, I have this file, student before, after teaching intervention dot csv. So the idea behind this file is there are a number of students from one to 20. And this was the progress to date before teaching intervention. And after getting help from the teacher. These were the scores, progress scores which they got. So now I'm going to use this simple data set to create scatter plots, closing this file, opening my art. So right here, I will create this variable teach in turn, which will be equal to read.csv. And mentioning my filename dot CSV. If I run it, you can see, of course I will get better because I have to set my directory to the source file location where that file is located. The source file location. Now if I run it, no error. If I show you the of this file. So student before, after, and we got this anomaly at the beginning of this column student. If I open this file again, we can see that it doesn't have the eye and dot, dot, dot. So at that time and this file was made, the certain practices were not adopted to create this CSV file. That's why every now and then, when you will be working with the data set, for example, you download some data set from Kaggle and when you load it, you will find this anomaly. So the most easiest way to correct this anomaly is by adding additional parameter, which is file encoding equal to u t f dash, eight dash, bomb, putting in double-quotes. So now if I run it, now if I show you my hand, not my head, the head of the file, of course. Now there is no problem here anymore. Let's plot this Sword Beach in turn. And then before column and DG in turn. After column, if I run it, this is the dataset. There's also another notation. You can use blog function and you can pass directly after column against before column. And I mentioned that data name. If I run it, now let's add a couple of parameters. So pch equal to 2 and col equal to that. So this is the scatter plot. For every single data point. It has created a link between the x-axis and y-axis. And put this triangle using this PCS parameter at that location. So you can notice that for this specific student before that teaching intervention is progress was around 12. And after teaching intervention, it be to 18. Now, there is one more dataset which I want to show you. Same directory. This dataset pulse before, after exercise. It has numerous columns. But what I'm interested in, in this pulse one and pulse two, so it has different indicators to my year, the effectiveness of exercise. But we will be focusing on pulse when n equals two. So let's copy this file name. If there is no problem of directory, because we had selected our directory earlier. But if I show you the pulse data, you can see a problem on the first column. So coping this file encoding parameter, listing it right here. Running it, knife or an it. No problem. Now for this data set, same thing. I can plot these pulses. So pulse to against pulse one. And the data would be pulse data. So when you are plotting something against the second column, in our case pulse to, in the beginning. If I don't have to, you can see this. Scatterplot. And scatterplots also tell you that in which numerical area most of your numbers are located. So from this plot we can see our cluster from 60 to 100 for x-axis and 60 to 80, or a 100 from y-axis. Most of our values are in between here. Now, these are when we have our own data to make scatterplots. Now let me show you Iris data. So going to that package of dataset, this Iris dataset, so library datasets. Data. Iris. Now plotting this data plot two of its column. So sepal length against sepal width. If I run it, this is the scatter plot. Legs are modified scatterplot version. So adding the BCS equal to. If I run it. You can see very good-looking scatter plots. You can add more parameter. If you go back to the earlier graphic parameter lectures, you can also varied our linear regression model on top of this data. For example, if I add linear model equal to Helom for the simplest linear regression model. And for the y column, I will use this width column. And for the x column X variable, I will use length. For Annette, it's giving me an error because we are building our linear model with against the length. So we need to use this symbol delta for Annette. Now if I show you my summary of linear model, you can see that we got the intercept value and the slope. There's t-value, p-value, standard error, level of significance, F statistic and R-squared value. We'll discuss all of these parameters in our next chapter. But for now, let's stick with the plots and charts. So if you want to draw the regression line, you can use this function abline and pass your linear model. And also you can give it some sort of color, let's say red and line width for a net. So this is the regression line on this plot. Now if you want to apply a line which is locally weighted, if you want to smooth this line, you can use this function lines. And boss. Louis, which is used to add locally weighted scatterplot smoothing. And in here, first mentioning my x-axis, which was sepal length. And then mentioning my y-axis, which was sepal width. And I'm not using the table dynamo, coma, and let's give it a color of blue and line with thoughts. One, if our net. So red line was the regression line. But this blue line is locally weighted scatterplot smoothing. So you can think of this straight line as the linear regression line, but this blue line is following the scatterplot patron. Now, instead of adding this locally weighted scatterplot smoothing blue line, you can use our car package to do all the work for you. You can find more information about this package if you run health package equal to car. So this package has many statistical functionalities like ANOVA. And also you can build your own special type of scatter plots with the locally smooth lines. So if you don't have this package, you can install this package. By running this line. I have already installed this, so let's just include this package. Now. By using this package, I will show you one of the scatter plot function from this package. It has many scatter plot functionalities. But let's work just with this scatter plot. Scatter plot. My data set would be iris, sepal width against sepal length. And pch equal to 16. Color dark blue. And for main. And while I've xlab, I'm going to copy this. If I run it. Now by using this car package scatter plot function, it has created marginal boxplots for sepal width and sepal length. And also you can see this smoother which goes from the regression line, this motor, and this dark blue line is our regression line. And this top one and bottom one represent the quantiles regression intervals. So by using this simple scatter plot function from car package, we have managed to create a very comprehensive scatter plot with the regression line, smoothing, weighted line, and quintile regression intervals, as well as box plots. You can find more information about this package. If you click on this package name. And here you will find plenty of information about the different functions. So scatterplots are a great way to find links, clusters in your quantitative variables.
19. 10 Mat Plots: Hi, This mad plot is not widely used, but it's a great plot to show multiple columns or plots in one figure. If I show you real quick the documentation of this map plot, you can find useful information about this plot. What you can do, what kind of parameters available in this plot, like PCH, like BG, sex, xlab, ylab, and so on. So let's start with a simple example. So I will create some sample mattresses over here. So metrics, one equal two metrics. Random normally distributed 12 numbers with four rows and three columns. Next matrix, size 12 from minus one to one, replaced, grew. Same number of rows and same number of columns running it. Let's see your mattresses. Now, let's plot these matrices using our mat plot function. So metrics, one, metrics to. So it will plot first column of matrix one against the first column of the second matrix, and then second column of matrix one against the second column of matrix to third column of metrics, one against the third column of metrics to if our net. You can see that we got this plot. So one represents the first column, to represent the second column, and 3 represent the third column. I can also modify it, so changing the type to 0. Now you can see that it has connected all the points of the matrices. If I choose another type, let's say ls. Now, no more wires, just the lines. So there are other types available. Like you can see type S, you can see type B, you can see type hedge. So if I add couple of more parameter to this med plot, Let's say I make it negative, I. Comma six and line width equal to 10. So this was just the random example to show you how you can use different parameters in matplot. Let's work with our dataset. So attaching my datasets and same data. Iris, if I show you the head of iris for numerical column and one categorical column. So if I quickly create the matplot out of this, Let's say I want the IRS columns 1, 2, 4. So basically these numerical columns and type would be being fed on it. This is the plot. So if I add next parameter of PCH, 1, 2, 4 to have four different shapes for our points. No, four colors. Let's go with four colors. So let's go into default colors from one to four, giving it a main title. And the last thing, I'm going to give it a legend. The legend position to be on the top left. And the legend name will be store sell per dose. So basically I'm giving these names from the sub-issues. It will be on the next line, and this will be the name of our vector. And right here, I need to specify the PCH and call if I run it now. Perfect. Now the same thing which we did with the matplot, you can use graphical perimeter bar to represent different plots in the same figure. For example, in case of this iris data, if I select the up our DO MF row, two rows and two columns. If I run it, and I'm the next line, if I plot my Iris data column 1 to 2 and plot iris column total three. So that's how you can write multiple statement on the same line in R by just putting the semicolon after the statement. If I run it, you can see these two plots. If I copy all of this again and paste it, now, changing it from three to four and from four to one versus run it again. So now you can see multiple plots on the same screen. You can achieve all of this by just adding one statement, plot, iris. But before running this 35th line, Let's make the screen to have only one row and one column. So if I run this, now if I run my plot C, So there are multiple ways to plot your data in R. It depends upon your needs. What do you want to plot, and what kind of data do you have? So before performing high level of analysis, always try to plot your data to get the overall idea.
20. 1 statistical tests: Hi, welcome to this new chapter of statistical tests and applications. I was very excited about teaching this section because when I was learning statistical modelling in our, I was finding our gap between concepts and statistical applications, but it won't be the same case for you. I have tried to make this section as easy as it can be, so that you can perform statistical tests in R with confidence. So if we divide our statistical tests, we will end up having two types, parametric tests and non-parametric tests. For interval scales, such as measuring IQ level or the room temperature, or ratio scale such as age 58, sales years of education, we use parametric statistics. And for nominal scale, such as male or female, ordinal scale, in which we have some sort of ranks like first, second, third, we use non-parametric statistics. And parametric tests rely on assumptions regarding the distribution of the population and are specific to population parameters. While non-parametric tests have very less, very less or no assumptions at all. For parametric test, data should follow a normal distribution, or you can call that distribution a Gaussian distribution. And non-parametric tests data does not follow a normal distribution. Some of the examples of parametric tests are correlation and regression. T-tests, analysis of variance, which assume certain characteristics about the data where Spearman rank correlation and Wilcoxon Rank Test lies in non parametric test area. So most of that time you will be working with the parametric tests. Not most all of the time. And one in a 100, you have to go for non-parametric tests. So in parametric tests such as regression t-test, ANOVA chi-square, we need to test whether something is statistically significant or not. So, statistical significance is our determination that our relationship between two or more variables is caused by something other than chance. And we decide whether something is statistically significant or not. By this rule of thumb, if p-value is less than significant level, then we would reject H naught and accept H1. But the question is, what is 5 percent level? And on top of that, what is H naught, what is H1 and word the name on Earth. This p-value is, let's look each of the concept piece by piece. So first thing first, significant level. So it is denoted by Alpha. And it is the probability of rejecting the null hypothesis. If it is true. Most of the time, we were to have significant level of 5%. But in some cases where you can bear some mistakes, you would choose 10 percent level. But if we are working in a very harsh condition, for example, we need to test if a machine is working properly, we would expect that test to make little or no mistakes as we want to be precise, we should pick a low significance levels such as 0.01, which is 1%. Most of the examples you will see statisticians picking 5% significant level. Let's dive deeper. For example, you can poor certain level of water in this bucket, let's say 10 liters or maybe 12. So if you go further than 10, then verse going to happen. Of course, water will be Spirit. So in certain situations, we need to be as accurate as possible. So in real-world scenarios like nuclear power plant or water dam spillway limit, we would expect more random or at least uncertain behavior, hence, a higher degree of error. So significant level is act like a limit. If you cross that limit, then there might be some danger in the results. So we take significant level as a limit indicator in hypothesis testing. Now, what is hypothesis testing? As you are already familiar, we take samples from the populations and samples are like a random slice of data generated from the population, which represents the population in statistics, we try to make an inference about some scenarios. For example, in general, board color, violet or black has an impact on kids learning. So the complete set of all whiteboards and blackboards are our population of interest. So we will apply some sort of test on our samples. And after performing that task, we will generalize it on the whole population of interests. But how to test this? This is we're hypothesis testing can help you out. So H naught is null hypothesis, which could be that two things are equal and alternative hypothesis for being H1, two things are different. So in statistics, we can define the corresponding null hypothesis as follow. In our board example, this was our inference to test. So H-naught could be the learning outcome with both types of boards are equal and H1, our alternative hypothesis would be otherwise, which is the learning outcome with both types of boards are different. So these are our very simple null hypothesis and alternative hypothesis. But what make us decide which hypothesis should we reject? Let's find the answer as you have already know. If we have our data which is normally distributed, we will have this bell curve between negative values on the left side and positive on the right side. And this bell curve has a mean of 0 and a standard deviation of one from the mean. This area is 99.7% of our data. This area is 95.5%, this is 68.3%, and these two parts are completely equal. So having equal halves is the fundamental characteristic of normal distribution. So if our data is normally distributed, then it will have a mean of 0, which will be true population mean. We already know that from the chapter of data distribution. Now coming back to our example of whiteboard, assume that the population of whiteboard usage is normally distributed like this. And all the data points satisfaction level received from the classes should look into shown way. So our whiteboard data will have a mean of 0, which is true population mean. But how we can decide whether the hypothesis is true or false by looking just at this normally distributed bell curve. Well, z-test can help. So in z-test, we have this terminology SAT score, which tells us how far is the number of deviation from the mean one data point is. And after finding a z-score, we go to this Z table, which tells us what percentage is under the curve at any particular point. And this z score and Z table helps us do hypothesis testing. This is the formula for z-score in R. In today's softwares, we don't have to calculate zed score for every data point the software's can do all of this work. The idea behind Z-test is we standardise or scaled our sample mean, which we got. So if the sample mean is close to the hypothesis mean, then z will be close to 0. Otherwise, it will be far away from it. Naturally, it assembled mean is exactly equal to the hypothesised mean, zed will be 0. In all these cases, we would accept the null hypothesis. But as I said earlier, in R and in modern softwares, you don't have to calculate your dessert scores to tell which hypothesis is true. So in case of a normally distributed data, we will have our data like this. This blue part represents 95 percent of the data, and you can call this blue part 95 percent confidence interval. This black part is the rest, 5%, 2, 0.5% on the right side, and 2.5% on the left side. So if we have a 95 percent confidence interval, then we will have a significant level of 5%. So when we calculate, or our software will calculate xhat, we will get some value. And if this value fall into the middle part, then we cannot reject the null hypothesis. And if it falls outside in the black region, then we can reject the null hypothesis. Now, if you want to find how to look in the z table, even though there is no need, then you can visit this link to learn more. But I don't think you need it. So feel free to check this link. So this black area is the rejection region. Now we got Z score, Z table. But what does the rejection region depends on the area that is cut off. Actually depends on the significance level. Say the level of significance is 5%, then we will have 2.5% on the right and 2.5% on the left side. Now, these are the values we can check from the table. And if alpha significant level is 0.025, 10 zed is 1.96 from the table. So 1.96 on the right side and minus 1.96 on the left side. Therefore, if the value we get for Zed from that test is lower than minus 1.96 or higher than 1.96, we will reject the null hypothesis. Otherwise, we will accept it. Every percentage of data has some set values associated. For example, this is the standard normal distribution. So for 68% of data, it will be one standard deviation from the mean. So this one is the zed value if our data is normally distributed and you can test it using L0 norm cumulative normally distributed function, which we have seen in our data distribution chapter. Now, if you are considering 90% of values, then it means you have a significant level of 10 percent. You can test this L0 norm with the positive value minus p norm within negative value. So if you are considering 95 percent of your data, then it means you will have a significant level of 5%. In that case, your zed value would be 1.96, which you can test with this p-norm function. I will show you these in the next movie. If you are considering 99 percent off your data, then your zed value will be 2.58. And it means that 99 percent off values fall within 2.58 standard deviation of the mean. You can test it with this binom positive minus binom negative. Now as you can notice, in these scenarios, where we have 68 percent of our data, 90% of our data, 95, 99. We are feeding two values. This was the two-tailed test where we have shaded region on both sides as the learning outcome for equal or not equal in that scenario, as u naught would be, the learning outcome with bow tie for boards are equal and H1 would be the learning outcome with both types of ports are different. In such scenario, we would look at both sides of significant level. If the value lies in the shaded region, we will reject our H-naught. So this was two-tailed test. Here is one example for a one-tailed test. Let's say I have this statement. Average student GPA is greater than 3.5 in ICT department. Now to test this, this statement is using greater than, i will develop volunteer test. So astronauts would be average student GPAs greater than 3.5 in ICT department. And one would be, is less than or equal to 3.5 in ICT department. And in such scenario where the statement was greater than, we would look at on the left side using the same significance level, our whole rejection region will be on the left. Now, if you are using classical technique, we will look into the table and that corresponds to a z score of 1.96. Since it's on the left, it is with the minus sign minus 1.96. Now, when calculating our z statistic, if we get a value lower than minus 1.965, we would reject the null hypothesis. We do that because we have the statistical evidence that the student GPA is less than or equal, otherwise, we would accept it. Now here is another example. We have this statement health department says, doctor on less than 200 K. Now to test this, it is of course, one-tail test because it's not using the language of equal or not equal. It's using greater or less. So greater or less means it is voluntary test. So our H naught would be doctors aren't less than 200 K. And H one would be doctors aren't greater than or equal to 200 K. And in such scenarios, the rejection region would be on the right side. So if that test statistic is bigger than the cut-off score, which will be positive 1.96, we would reject the null hypothesis, otherwise we will not. Now here is another example for you. This statement, health department says drug has improved people held. Now as a statistician, you need to test this. You have all the data. So first you have to decide whether this is one-tailed test or two-tailed test, as it's not using any sort of equality. That's why it will be one-tailed test. So our one-tail test would be drug has improved people health and our H1 would be alternative hypothesis would be drug has VKC and people health. Now, if the health department give you some statements such as Health Department says that drug has no effect. Then this means this is two tailed test. Our H naught would be the drug has no effect. And one would be drug has an effect. So if the statement is using greater than or less than language, then that would be one-tailed test most of the time. And if it's using equal, not equal effect or no effect, then that would be two-tailed test. So here is another example. Health department says Dr earned more than 200 K. So the ash naught would be doctors are more than 200 K. One would be doctors on less than or equal to 200 K. Now what if department says doctors on average 150 k? Then it will be two-tailed test because it's using the word average. So doctors aren't equal to 150 K H naught. And for alternative hypothesis at one, doctors are not equal to 150 k. So we have learned dozens of new concepts in this video. But is there a rule of thumb to summarize all of this? Well, congratulations, there is, which is p-value. So p-value is the probability of null hypothesis is true or false. So I'm giving you a rule of thumb to summarize all of the complicated discussion whenever you will do hypothesis testing, like I'm going to show you in this chapter. We will obtain this p-value, which will be the probability of null hypothesis true or false. So if the p-value is lower than 5%, which is 0.05, then H1 is true and we will reject our H naught. And if the p-value is greater than 5%, which is 0.05, then we can say there insufficient evidence to reject H naught. So this is the summary of all of the discussion and all of the work which we are going to do in this chapter. Remember, I'm repeating myself again. If that p-value is lower than 5 percent, then H1 is true. And in every statistical test in this section, we will obtain some p-value. So if that p-value is lower than 5 percent, then S1 is true and reject H naught. If that p-value is greater than 5%, which is 0.05, then we will say that there are insufficient evidence to reject H naught. Now, all of these discussions including z-scores, hypothesis testing, p-values rejection region, significant level. H naught H1 is only for parametric tests. And parametric tests only holds when you have normally distributed data. But let's say a normal distribution has no skew. Or if you don't have a graph, how do you figure out if your data is normally distributed? Though? You can check the skewness, kurtosis of the distribution or check with Shapiro-Wilk test. We have already seen skewness and kurtosis in the previous data distribution chapter. You can go back to data distribution chapter and refresh your concept if you want to. Anyways, I will conclude this lecture on the p-value, because in all the statistical parametric test, we are going to decide whether to go with null hypothesis or alternative hypothesis by looking at the p-value. Now, in the next couple of movies, we will look at w0 value of different tests and make inferences. So see you there.
21. 2 Data Distribution and Simulation Finished: Hi. Remember at that time when we were working with Section five, data distribution and simulations, we have one incomplete file. If I show you this two data distribution and simulation. This was the file. And we completed every data distribution except for cumulative end quantiles. And add that time. I told you that when we will reach to our statistical tests section, when we will learn about 19 95% confidence intervals using z-score, then this part of cumulative will make much more sense. So let's not wait further and complete this file first. So if I show you what is in p-norm of 0, you can see that we got 50. If I show you be num 1.960. So the positive side will have 97 percent and the rest of the shaded region will be in the p-norm off minus 1.960. Let's add a comment here. Now, let's suppose that X is standard normal, which means it will have a mean of 0 and standard deviation of one. In that scenario. If we minus our p-norm from one minima for 0, then we will have 50 percent and one minus p norm of 1.960. Then that would be the rest of the shaded area, which will be in the negative side, same as line 42. The next thing is, let's compute our confidence interval with two-tailed test. So we will pass left-hand range and the right-hand range. In that scenario, if I have b norm of 2.576 on the positive side and b norm of minus 2.576 on the negative side, then we will have 99 percent confidence interval. And this cumulative range tell us that the confidence interval is 99 percent and we will have 0.1 significance level. So what's happening here? We are giving z-scores and it's returning confidence interval. So this will be for 99 percent. Now for 97%, it will be p norm parsing the zed value, which will be 2.3 to 6 minus same value. But with the minus sign this time. If I run it, you can see we got 97. So disconfirm that we have 97 percent confidence interval and 3% significant level. Now for famous and widely used confidence interval of 95 percent, we will have been on of 1.960 on the positive side, minus. 1.960 with the minus sign. So it will return us 95 per cent confident interval. Now for 90, it will be a value of 1.6456454, the negative range. And this will return us 90 percent confidence interval. Now for 80 percent, we will have a value of 1. Now for 68%, it will be only a one standard deviation from the mean, running it 68%. And lastly, for 50 percent, we will have a value of 0.674. On both side, one will be positive and one will be negative. Now this is to feed zed values to the cumulative function to get the confidence intervals in R, I have picked one real life example to show you the actual working of this L0 norm functions, we have this question that the healthy individuals muscle mass is well-described by a normal distribution with a mean of 138 and our standard deviation of 15, then if a person has a muscle mass of 165, then calculate in general population versus the probability of a person having 165 muscle mass value or higher. So I can use p-norm. So we are going to calculate the probability in the whole population. So the whole population will be a 100 percent. So 1 minus P nlm. And we want to calculate the probability of a case when the muscle mass is 165. So I want to calculate the probability of 165. And for the population, we have this mean of 138 with the standard deviation of 15. Now if I run it, you can see we got 0.035. So I can equal it with, let's say a, then it again. Then multiply a with a 100. So 3.5, so there is only 3.5% of the general population that might have 165 muscle mass or higher. Now lastly, for this condyle section, we have this question to answer. What is the IQ score? One would need to have to be in the 95th percentile if standard deviation is 20. Now we want to find the IQ score such that the cumulative probability at that score is 95 percent. And these sorts of questions ask about the inverse cumulative distribution functions, or you can say quintile functions. So by using this qnorm function, we want to calculate the IQ score in the 95th percentile, so zero-point and 95, and we have total a 100 percentile. And the standard deviation which were given to us is daunting. If I run it. You can see we got an answer off 132. So if a person wants to be in the 95th percentile when the standard deviation is 20, he needs to score 132, 0.89 to be in the 95th percentile. And as I said, that condyle and cumulative probability functions are inverses of each other, then you can backtrack to this. If you just feed this value to your cumulative function, Let's try it out. So b norm has the value of 132, 0.8971, this value. And we will have a 100 percentile. And the standard deviation would be 20. If I run it, you can see that we got 95 percent. So if a person has 132 point 89 IQ score with the standard deviation of 20, then he will be in the 95th percentile. And let's say if a person has 110 with the same standard deviation, then that person would be in the 69th percentile of the whole population. So instead of putting this 132 value, if you just use the function P norm and pause this qnorm parameter, which will calculate 132. You pause it right here. In the next parameter. If you give it a 120 standard deviation, then it will calculate the same 95 percentile for Annette. Oh, let me add the comment here. So this section of data distribution and simulation has been completed. Now I will save this file as three, data distribution and simulation finished. As this section has been completed. Now it's time to start working with statistical tests.
22. 3 Single Proportional Test: Hi, The first statistical test which I'm going to show you is proportional test. As you already know from the previous movie, we develop hypothesis to check whether or not something is statistically significant. For example, we can check the proportion of two results, whether they are equal or otherwise. So this proportional testlet, you compare two proportions and help you decide whether those proportions are significantly different or not. And of course, the decisions are made using the P-value and confidence interval. So if the p-value is small, proportions are not equal, we will reject our null hypothesis and go with the alternative one. Which means there is a difference between the proportions. But we are not sure in a positive direction or in the negative direction until we go with the one tailed test. And if the p-value is high, we cannot say that the proportion does not match. Let us see this proportional test in R. So I'm in my directory and I have created this folder of section it statistical tests, in which you can see this file, one single proportional test opening it. So this will be the file. Let me add the basic information about statistical tests. So p-value, what is statistically significant? And what if the p-value is lower? And what if the p-value is high? And here is the information about proportional test. Now, let's start working with proportional test. You can always question mark, then prob dot test. If you run it, you will find information about this test, what kind of parameters you can pass, and how you can use different parameters to define whether you want your test to be one-tailed or two-tailed test. You can read more information about this test if you see this documentation and references available. So I will start with a basic example. Suppose a student God accepted for a bachelor program in 14 year or cities out of 20 universities he had applied in. Now, our job as a data analyst to check is the acceptance rate is significantly greater than 50 percent. So we have two values over here. So first thing first, let's create our table of proportions to get the overall idea of this question. So I'm gonna create this variable a, in which I will get here one array passing these two values, 14 and 20, giving dimension to this array, which will be one to two if our net, if I show you my air, so this is 1420 array. If I apply this function prop table and pause this a. You can see that the 14 has the proportion of 41 percent and 20 has the proportion of 58%. The reason why I used array, because in prob dot table, it will only accept at a type of title. So now let's work with our proportional test. First, we need to make inferences over here. So our null hypothesis would be the acceptance rate is significantly greater than 50 percent. And alternative hypothesis would be the acceptance rate is not significantly greater than 50 percent. Now, to test this, I'm going to call this function rob dot test. And passing 14 acceptance rate in 2018 universities the student has applied in. So if I had a net, you can see that this was our data, 14 out of 20. The null probability is 50 percent, which means we are looking at coin flip 5050 jazz. The chi-squared value is 2%. 45. Degree of freedom is one because we have only two values. So degree of freedom always will be total number of values minus one. The p-value is very high. If we have a significant level of 5%, we are getting 11 percent p-value. This is our alternative hypothesis, true p is not equal to 0.5, and this is the 95 percent confidence interval which gives difference between the proportions. So there will be a 95 percent chance that the difference between proportions would be between 45 percent and 87 percent. And this is the observed sample estimate of 70%. So as we have mentioned on the top of this file, that we will make our decisions based on the p-value. So if we look at the p-value, this is our p-value is 0.11. So according to the p-value, with the significance level of 5%, we are getting a very high p-value, which means we would go with the H naught. And we will say that we cannot say that the proportions are significantly greater than 50 percent because that p-value is very high. And one more thing to consider over here, our question was to check the acceptance rate is significantly greater than 50 percent. We are not checking whether it's equal to 50 percent or not equal to. So from the question, it should be one tailed test, not two-tail test. So according to this scenario, we would say that we cannot say that the acceptance rate is significantly greater than 50 percent. And we cannot reject our null hypothesis. Now, before going to the right type of test, which will be one-tailed test, I want to show you something over here. In this case we have only 14 out of 20 per portion. But as you increase the number of values, for example, you are checking the student got accepted in 114 out of 200 units he had applied in. Then you can notice that the p-value will be very, very small. In that case, we would go with alternative hypothesis as the b value is low. Now what if we have this data on that? He got accepted in 1400 universities out of 2000 universities he had applied in. If I run it. This time, you can see that the p-value is still very, very small, which means we will reject our null hypothesis. Now, what if he got accepted in 14 thousand universities out of 20000? If I run it, you can still see the p-value is very, very small, so we will reject the null hypothesis. So to answer this question, whether the acceptance rate is significantly greater than 50 percent, we need to go with one tailed test. So prob dot test 14 was the acceptance rate out of 20 universities he had applied in. And now you have to give alternative hypothesis. And we can say that alternative would be greater. So if I run it for alternative hypothesis, you can use alternative or you can just pass. Art is equal to greater. But with two-sided test, you won't have any sort of alternative parameter over here at all, or you will have alternative parameter equal to 2 dot sided. Seeing we're getting the same value as this one. So this is two-tailed test. Anyways, we want to go with the alternative hypothesis of greater. So our H naught and H one would be H naught would be the acceptance rate is not significantly greater than 50 percent. And alternative hypothesis would be the acceptance rate is significantly greater than 50 percent. So if I run this, you can see that I got the value of 0.05. So p-value is almost equal to significant level of 5%. So we can say that we can reject our null hypothesis and we can say that the acceptance rate is significantly greater than 50%. If you want to be precise, you can give it a custom confidence interval. Instead of going with the default 95 percent, you can add one argument over here, gone from DOT level and put your confidence interval here. So confidence interval of 99. If I run it, you can still see the p-value is 0.058. So in this scenario, we can reject H naught. But in the earlier scenario, when our confidence interval was 95 percent. The default one, then our significant level was 5%. So with the confidence interval of 95 percent, we reject H naught and say that the acceptance rate is greater than 50 percent. But without confidence interval of 90, our significant level would be 10 percent. So if we compare this to the p-value, we can still reject the null hypothesis because the p-value is really small. This is when you have your alternative hypothesis significantly greater than 0.5%. You can also confirm it from here. But if you want to check whether the acceptance rate is less, let's redefine our tests. So alternative hypothesis would be acceptance. It is significantly less than 50 percent. Now in that scenario, we will have our prop test passing our values. And this time our alternative hypothesis would be less confidence interval of 95 default one, you can see that p-value is very, very, very high. So in this case, we cannot reject our null hypothesis. And also we need to alter our null hypothesis. Because if we are checking the alternative for the less, then our null hypothesis would end up with the statement that the acceptance rate is equal or greater than 50 percent. So as the p-value is very high, then we cannot say that the acceptance rate is significantly equal to or greater than 50 percent because our test only confirmed the alternative hypothesis right here. So in the first case, when the p-value was very high, we cannot say that the acceptance rate is significantly greater than 50 percent. Again, reject our H-naught. P-value is low. Reject H naught, p-value is low, reject H naught. Same thing here. Now for this one tail test, the acceptance rate is not significantly greater than 50 percent. F1 would be the acceptance rate is significantly greater than 50 percent. So we got very low p-value in this scenario. Deaths, why the acceptance rate is significantly greater than 50 percent. Here. When we run this prop dot test for 90 percent confidence interval, we got a very low p-value because the significant level would be 10 percent. That's why in that scenario we can reject our null hypothesis because the p-value is very small and stay there. The acceptance rate is significantly greater than 50 percent. And this was when we were using the 95. So let's remove it. Now, in the case of less, our null hypothesis was the acceptance rate is equal or greater than 50. The null-hypothesis is, the acceptance rate is significantly less than 50 percent. We pass this manometer less confidence interval of 95. So when we run this, we guard very high p-value as compared to the significant level. That's why we cannot reject our null hypothesis and say that. We cannot say that the acceptance date is equal or greater than 50 percent. Now, that's how the single proportions test to Vogue menu only have single proportion to test. You can develop your inferences and check them accordingly.
23. 4 Double Proportion: Now, suppose we are not comparing. Now suppose we are not analyzing one student. We are analyzing two students acceptance rate. Now in that scenario, we can still use this test proportional test. So this file off single proportion. I have saved it as w proportional test. Let me remove all of this. So I'm going to clear this variable. Applied universities. So we have two students who have applied both in 2020 universities. And the acceptance rate for both of the student is 1411. So you can think of this as trials. And you can think of this as success. Consider flipping a coin. So if I call this proportion test not bossing the acceptance rate. So the next argument would be applied universities. And the confidence interval is default 95 percent. If I run it, you can see that we got this guy skirt value, degree of freedom, p-value is very, very large. The difference between these two students proportion is from minus 19.49. And the sample difference is for proportion 170 percent and for proportion to 55 percent. So first we need to develop our H naught and one. So we can say that both students have equal acceptance rate and our H1 would be not equal acceptance rate. So as the p value is very high, point 51 percent compared to our significant level of 0.5. That is why we cannot reject H naught, but again, not say both students have equal acceptance rate. And you can see from the alternative hypothesis, this is a two-sided test. Suppose we are analyzing more than two students in that scenario. Let's say we have the first vector, 2020, 30, 25, 40. And the acceptance rate is 14, 17, 18, 19. If I run it over here, and 95. So now you can see there the p-value is very small, which means we can reject our null hypothesis. And we can confidently say that the students have not equal acceptance rate. We can see the students have except transcript. Now in this scenario we're running two-sided test, and these are the sample differences as we have five values, five data points. That's why phi minus 1, four degree of freedom. So when you have more than one proportions to compare, then your test would be two-sided test because took compare proportion one-by-one, you need to have one-sided test data, how you compare different proportions.
24. 5 T Test Overview: Continuing statistical tests and applications, one of the most used statistical test is a t-test. T-tests are a type of hypothesis tests that allows you to compare, means. The t-test tells you how significant that differences between the groups are. In other words, it lets you know if those differences my yard in means could have happened by chance. A large t-score tells you that the groups are different. And a small t score tells you that the groups are similar. For example, if you get a t score of three, then that means that your groups are three times as different from each other as they are within each other. And low p-values are always good. Day indicate your data did not occur by chance. There are three main types of t-tests. Default t-test, which is also called one-sample t-test, doctests, the mean of a single group against unknown mean. For example, you can test the mean of your sample against our population mean. And in default, t-test, we compare our mean to 0. If we are comparing it to 0, then that means our data should be normally distributed. The second type of t-test is two-sample t-test, which is also called independent samples t-test, which compares the means for two groups. The last type is paired sample t-test, which compares the means from the same group at different times, say veneer apart or blood pressure before or after exercise. Now let's look at these types piece-by-piece. So as said earlier, one sample t-test, test domain of a single group against unknown mean. For example, compare whether the mean weight of mice differs from 200 mg. You have the mean weight of 200 mg from the previous study. Now you want to compare whether the new mean weird of mice differs from 2000 mg R value determined in a previous study. Or you can see one-sample t-test in an experiment where you have control and treatment conditions. So if you express your data as percent of control, you can test whether the average value of treatment condition differs significantly from a 100. Or suppose you are interested in determining whether a car assembly line but reduces guards that with 1630 kilogram. To test this hypothesis, you could collect a sample of cars from the assembly line, measure their weights, and compare the sample with a value of 1630 kilogram using one-sample t-test. So this one-sample t-test compares one sample mean to a null hypothesis value. But remember, one sample t-test can be used only when you have normally distributed data. Now, the two-sample t-test, which is also called independent samples t-test compares the means for two groups. For example, you can run our test to check whether the seals in east and west regions are different. Or say, you can test if the average heart rate of males and females are different. So t-test answered these type of questions and tell, could these differences have occurred by random chance? This two-sample t-test is only for two groups. But what if you have more than two groups in that scenario, you can use analysis of variance, ANOVA or Tukey HSD methods to compare group means to the overall mean. And the default null hypothesis for our two-sample t-test is that the two groups equal. Paired sample t-test compares the means from the same group at different times. It simply calculates that difference between beard observations, say the growth of a company in 2016 as compared to growth of the company after one year, let's assume student was learning by himself. And after teacher intervention, we need to measure, is there anything different in student progress? So in this type of before and after scenario, we can use paired sample t-test, where we can analyze before and after data. But remember, the before and after test score for this student should be for the same subject. But if the score in each row are four different subject, it does not make any sense to calculate the difference. In that case, you need to use the two-sample t-test. So bear t-test simply calculates the difference between paired observations and then performs a one-sample t-test on the differences. For second example, let's consider, you want to measure the effectiveness of our product on some numerical scale. Now in the next movies, I will show you how to work with these t-tests.
25. 6 One Sample T Test Default T Test: I have created this file, one-sample t-test, which is also called default t-test. Now before working with this file, if you want to find more information, you can go t dot test. And R will give you basic information about this t-test, which is also called Student's t-test. Anyways, let's work with this default dataset. Let's say I have this data, 0.9 minus 0.81.3.3, negative 1.7. So in order to apply a t-test on this data, the default t-test, which will compare it to the mean of 0. I can simply pass this data. If I run it, you can see we got the value of 0.3, which is very high if we consider the significance level of 5% and t value is not very large. Well, there is no rule of thumb. What value of t you can consider as large. But if you have very low p-value, then automatically you will have very large t-value. And here is the mean of this x. So in this t-test, we try to compare it to the amino of 0. We got very high p-value, which means that as the one-sample t-test compares the sample mean to a null hypothesis value. So in this scenario, our H naught would be Grameen is equal to 01, would be true mean is not equal to 0. So as the p-value is high, more than the significant level, that is why we cannot say that the true mean is not equal to 0. Now let's run this one sample t-test on our iris data. So liberating datasets data iris. If I find the mean of this data, let say a sepal width column, it's 3.05. If I plot this, you can see that it seems like a normal distribution. If you just look at this shape, however, if you look at the mean, doesn't seem like a normally distributed histogram. So suppose you don't have this x-axis. You just have this curve. Now to test just this curve whether the data is normally distributed. We can use one-sample t-test to compare it to 0. So t dot test Iris data columns sepal width. If I run it, you can see that I got a p-value which is very small. So this proves that from a statistical point of view, it's really true that the true mean is not equal to 0. So as this time, the p-value is very small, which is less than the significant level. That is why we can say that the true mean is not equal to 0. Now this was two-sided t-test. If you want to go with the one sample t-test, what you can do, you can pause the additional oligomers of art. Greater. If I run it, I'm still getting very small p-value, which also confirmed that the H naught should be rejected. And we can say that the same thing, p-value is very small. That is why we can say that the true mean is not equal to 0 as the true mean is 3. So if I copy all of this and pass one additional argument right here, that the population mean is equal to, let's say three. If I run it, you can still see the p-value is almost equal to the significance level. So I can say that p-value is very small. That is why we reject the null hypothesis. And we can say that the true mean is not equal to 0. This mu, this is the population mean. Now, same thing. Instead of feeding the same. And the exact mean mu of three. If I make them immune to be four. If I run it. Now you can see that p-value is very large, which means we cannot reject H naught as the p-value is very large. That is why we cannot say that the true mean is not equal to 0. Because we cannot reject H naught as the mean. 3.03, which is our actual mean. It's not greater than our population mean. So that's how you work with one-sample t-test in R.
26. 7 Two sample T Test Independent sample T test: Now working with two-sample t-test, which is also called independent sample t-test. Here is the basic definition of this detest, the default hypothesis testing rule of thumb. And in this movie I will be working with this hard data set, which you can find on this google.com link. I have already put this dataset in this directory of Section eight statistical tests. Now let's load this data. So heart, the dot CSV. Now don't forget to set your working directory by going to session. Set Working Directory to source file location where you have both of these files. This our file right here, and this is our heart dot csv. So let's include this file. Now let me show you the colnames of this data. So we have these column names and you don't find any anomaly of this I dot-dot-dot problem because we have included the file encoding. Now out of all these columns, I'm interested in only in two columns. First one is sex column, the other one is this maximum heart rate achieved column, which is this one, THE LA CH, because we want to check whether or not the maximum heart rate mean is same for both group of sex, males and females. So I will create my own dataset. So hard. Sub, this will be the name of our dataset. And data dot frame. Heart column, sex, and heart column of this heart rate achieved. Now running it, let's give it names. But first let's see the names of this one. So it has a name of ALS he edge. So we can equal column names to this vector OFF sex and mix. Heart rate. Let me get it. If I show you this heart Sub, you can see that we have created this new DataFrame out of this CSV file. And it has only two columns, sex and maximum heart rate. No, I can directly attach these columns instead of using the dollar notation. If I can show you the type of this heart sub, you can see it's a list. So if we ask for some quick insights, we can call the function of summary. So not on the heart. I want to check it on hard surfaces. So this is the summary for sex and maximum heart rate, mean of 0.68 and mean of 149.6 for the maximum heart rate. So if I draw this data using histogram, just this maximum heart rate column instead of sex. Let's give it some color. Okay? So it seems like a bell curve. If, but if we have these x axis, then it's clearly from the histogram, this is not a normal data, right? And if I draw a box plot of this maximum heart rate against the sex column, you can see that for females for 0, it has a bit of higher median as compared to male. And we also have some outliers in the female section as well as in the male section. It seems different from the boxplot. But let's prove it with T test whether the maximum heart rate is actually different for male and female or whether they are equal. So running this t-test. So maximum heart rate against the sex column. If I run it. This is the two-tailed test as the p-value is very high. So if we have our hypothesis test, such as maximum heart rate means are equal for both genders for null hypothesis and H1, there is a difference between maximum heart rate means for both genders. Then according to the p-value, as the p-value is very high for this two-tail test, then we can say that as a p-value is very high, we cannot reject our H-naught and say that we cannot say that the heart rate is not the same for both genders. Now, if I run this with different parameters, one-tail test and I make the alternative hypothesis greater. If I run it. Now you can still see the p-value is still very high. So we would have the same outcome. You can still add more parameter to this, for example. You can check it for less. That still be the one tailed. And if I run it, you can still see the p-value is very high. But if we look on these three test together, when we run our greater test, it did not reject our H naught. When we done our less one-tailed test. Still, the result is same. So this means there is no way we can reject our null hypothesis of maximum heart rate means are equal for both genders. We cannot reject it, but dashed away. We died our desert. When null hypothesis is not false, this is the way we write. So I can conclude that the heart rate for both of the genders are statistically same. I have picked one more data set for you to explain this two-sample t-test. This is our Sales dot CSV data, which has two columns, sales and location. We have location of Vest office and East office. Now we are going to compare whether the mean sales from both of the office locations are same or they're differ. Example 2. So now we are going to compare whether the sales differ for different locations setting working directory. Now I will have this variable sales, which will include that file, the dot csv, showing you a couple of rules. Okay, So there is no problem. And the first column. Now the next thing is I'm going to show you the summary for this sales. So we have 38 sales coming from the east office and 34 sales coming from the West office. And the seal median is 160, 0.9, with the mean of 157.4. So our inference for this data is to test whether the group means are different for both of the location. So our H naught would be East office sales and West office sales equal, equal to 0. And alternative hypothesis, etch a or H1 would be sales in both of the locations are not equal. So d dot test, passing sales column against the location column. And I can mention my data name here, which will be sales funnel net. So if you look at the p-value, it's very high, which means we cannot reject our null hypothesis. Now, if I run this test with the confidence interval of let say, 0.9, you can still see that p-value is very high. And we can conclude that as the p-value is very high, and that's why we cannot reject our null hypothesis. And you can also see the east Office mean and West Office mean here. And we are getting the 90 percent confidence interval that the difference in East Office mean and West Office mean will be between minus 2.215.4. Here is the T value, which is not very large, as a p is very large. And this is the degree of freedom. So using two-sample independent t-test, you can tell anyone the difference between groups.
27. 8 Paired T Test: Hi, Welcome back. Now the last type of t-test, which is paired t-test. For this, I have this scenario for you. Consider 20 patients randomly drawn from one area. They're resting. Systolic blood pressure is something like this. Now, we need to create hypotheses to test is there are difference between before and after reading. So our hypothesis would be like this. So null hypothesis will be, there is no difference. And H1 would be there is a difference. So first let's run these and this is a bare desk. So before doing anything, Let's find the mean of before and the mean of after. So very little difference between them, which we can observe, Let's see it on the test. So t-test before, Goma, after. And as this test is baird, So I have to pass this pedometer. Baird is equal to grow for Annette. So we got a p value and P value is very high, which means we cannot reject our H-naught. Now, if I run this same test with volunteered, so alternative would be greater or less. Let's try it with greater. You can still see a high p-value. And this is the mean difference between the samples. And p-value is high, we cannot reject null hypothesis. So this is one tailed test. If I check for less, whether one is less fat on it, you can still see P-value is very high. So it has proved that there is no difference between these two readings. So we can say that we cannot reject null hypothesis as the p-value is very high. Now for the next example in my directory, I have this file student before, after teaching intervention, we have worked with this file many time. So let's say I have this STD data, which is equal to read dot csv, mentioning the file name. Your fellow net large. See the head of this file. Let's see the summary of our data. So we got the mean of 18.4 for the before column and 20.45 for the after column. So there is a very minimum difference between the before and after column. So if I Plot these. It can also confirm that before and after is really close, but are they really close? Less confirm it by using pytest? First, we need to develop our hypothesis Questions. This is our question. The marks for a group of students before and after are teaching intervention. And our research question is, is there are difference in Mark's following our teaching intervention. So our null hypothesis would be there is no difference in mean pre and post marks. And an alternative hypothesis might be, DID is our difference in mean pre and post marks. So to test this, we need to have our paired t-test. So I'm going to pause these columns. So before and after, and this test will be paired. So Baird is equal to true. So if our net, we got the paired t-test value, we got negative t value, and this is our p-value. Their demeanor. I told you sometime earlier, if that p-value is very small, then p-value must be higher. But this D value is in the negative direction. And also we are getting a 95 percent confidence interval in the negative direction, same with the mean of the differences. It's because I have put the before column at the beginning and after column in the second place. If I repeat this statement and swabbed the places for these two. And now if I run it, then you can notice that the value is in the positive direction. P-value is still the SAM. And also we are getting the diet 95 percent confidence interval for the mean. And this is the difference of the mean. So it does not matter as far as P-value is concerned. Which column you put in the first place in t-test. Bear t-test. But it will definitely affect on your t-values decks why you should try to port the after column first. Okay? So if I run it one more time. So p-value is very, very small, lower than the significance level. That's why we can reject. Our H naught. And we can confidently say there is a difference in mean. Pre and post marks. Know if I run it for one-tailed t-test. And I poured alternative equal to less. If I run it. Now you can see that we got very high p-value. So we cannot reject H naught. If I change this one tailed. For alternative greater, whether the after value is greater if I run it. Now you can see that we got very low p-value, which means we can reject H naught and say there is a difference. Let's put it here. And this pair t-test confirm, confirms that there is a difference in mean pre and post marks. So we can conclude that by looking at this two-tail test on line 47, that as the p-value is very low, 0.004. So there is only 0 for very small probability of this result occurring by chance under the null hypothesis of no difference. That's why the null hypothesis is rejected. So I can conclude there is a strong evidence by looking at the T value and more importantly on p-value, dad, teaching intervention improves Students marks. And in this data, it improved student marks on average, approximately by 2.05. And of course, if we were to take other samples of marks, we would get a mean bear difference in Mark's different from this 2.05. So if we add more samples of marks, we could get a mean bear difference in Mark's different from this 2.05. This is why it is important to look at the 95 percent confidence interval. So this 95 percent confidence interval tells us that if we were to do this experiment, a 100 times 95 times the true value for the difference would lie in the 95 percent confidence interval. And in our case, 95 percent confidence interval is from 0.7 to 3.3. Disconfirms that although the difference in Marx is statistically significant, it is actually really small. So you will need to consider if the difference in Marx is practically important, not just statistically significant.
28. 9 F Test ANOVA Tukey HSD: Hi, we have seen how to compare different groups means using t-tests. Now, there is also another very useful test to specify the variations in your data. That is F-test. So F statistic is simply a ratio of two variances. We already know from the chapter of data distribution what is variance? So variance is the square of the standard deviation and the measure dispersion. Or you can say they tell us how far the data are scattered from the mean. Large values represent create dispersion and low variance values represent low dispersion. And we already know that if we sum all the variances to a single point, then we can call that total sum of squares. Let's say we have three sets of data, blue, green, and brown. All of them are normally distributed, but these bell curves are not very wide. And even if they are wide, their means are very closer to each other. For example, for this blue, it has a mean of six for the screen, it has a mean of seven. For brown, it has a mean of eight. So the variation between the mean of these three datasets is not very large, so means are closer. Standard deviation is not very large in this group. That's why low variances and low F value. As F value is simply a ratio of two variances. So the low F value graph shows a case where the group means are close together, means low variability relative to the variability within each group. And here is another example. We have three sets of data. First one has a mean of 6 second 1131 brown, 16. Now, means are far and standard deviation is large in this group. That's why high-variance means you will have high value. So the high F value graph shows a case where the variability of the group means is large relative to the within group variability. So in order to reject the null hypothesis that the group means are equal, we need a high F value. And of course, if we are prepared to assume that the data is normally distributed or not far from it, or the dataset is large enough, then the F statistic under the null hypothesis that the group means are equal has a pre-determined distribution called the F distribution. F-tests are always positive and are always in fact two-sided. To use the F test to determine whether group means are equal. It's just a matter of including the correct variances in the ratio in one-way ANOVA. The F statistic is this ratio, variation between sample means divided by variation within the samples. The best way to understand this ratio is to walk through a one-way ANOVA example. Anova uses the F-test, could recur mind whether the variability between group means larger than divisibility of the observations within the groups. If the ratio is sufficiently large, you can conclude that not all the means are equal. This brings us back to why we analyze variation to make judgments about means. And that's why you use analysis of variance to test the means. So the large F value means, not all means are equal. The F-test in ANOVA just tells us whether at least one pair of means differ. But which beer you can use t statistic to tell which pair is different. And this brings us to Tukey, HSD, Tukey's Honestly Significant Difference Test. This test is named after the American mathematician John Tukey, and it allows for the multiple testing by considering the distribution of the maximum t-statistic across all categories. So after considering the distribution of the maximum t-statistic across all categories, we take the pooled variance from ANOVA, also called means scared for error or MSE. And then we compute our t-statistic for each pair of means. So if I summarize all of my discussion, I would say that after just use F statistic to identify if v1 of more categories from a set of categories have a different mean, and F-test only focuses on the variances. Low F value means low variance and high F value means high variance. And for ANOVA, the desert of an F-test are presented as an ANOVA table. And for Turkey to identify which bears have different means, we use Tukey's HSD F-test, just tell us whether the categories are different. We see the desert in ANOVA table. And then by using this Tukey HSD method, we identify which bears have different pins in are making judgments by using these methods is way more than easy. And I will show you that in the next movie.
29. 10 Performing F Test ANOVA Tukey HSD: Hi. We have this file six F-test ANOVA techie at Jesse. We are comparing the means with F-test, ANOVA and turkeys, adjusting definition of p-value. What is distinctly significant? What does lower p-value indicate? And what does high p-value means? Now right here, I will start with loading the dataset. So the dataset which I will be using is Iris dataset. If I show you the head of this data for numerical columns and one categorical column. And if I show you the levels of Iris species, we have three groups, setosa, versicolor and virginica. So we are interested in finding the variations between these three groups using F-test ANOVA and Tukey's HSD. So first of all, if I just plot this data by using the dot chart. So let's say I will go with the column of sepal length. And for the groups column IDs make the species group. And for xlab, sepal length and four points, I will use pch 16 running it. So this is the dot chart which we have received. So this is the first group for sepal length. So I'm trying to find the variation in sepal length according to these three groups. So if you just look at this chart, you cannot tell how much variability is between the groups or variability within the group. It seems like we have more variability in virginica as compared to set those are. But we cannot statistically prove that until we use F-test ANOVA, our Tukey's HSD. If I show you all of this by using boxplot, several dot length against species. And data will be Iris running it. So you can see that the means are different. If you just look at this boxplot, this one has a lower mean. This one has a mean in the middle of virginica setosa. And this one has the higher mean. So let's use F times to really see whether these group means are statistically different or not. So I'm going to develop hypothesis test. So our H naught would be all means are equal for all categories. And our edge one would be at least one pair of means are not equal. We will perform this test using the normal approximation, and this may not be accurate if that data sample is small. Remember, in the previous movie, I told you that you should have a large sample size if you want to perform one-way ANOVA. So if I show you the basic documentation of one-way ANOVA test, test for equal means in our one-way layout, here is the formula, the basic description you can read from here. Now, let's conduct one way, ANOVA, v1 dot va. And mentioning my column of separate dot length against Species column, our data will be iris. And var dot equal is equal to true. Now this perimeter of var dot equal tell us that a logical variable indicating whether to treat the variances in the samples as equal. If true, then a simple F-test for the equality of means in a one-way analysis of variance is performed. A false and approximate method of Welch is used with generalized are commonly known two-sample Welch test to test to the case of arbitrarily many samples. So we are performing it on more than two groups. And to use one-way ANOVA, you need to put true. It also says that this logical variable, to read the variances in the samples as equal, which will be your null hypothesis. So now if I run this, you can see that this is the output I got. First. I have F value which is very, very high, which proves we have high-variance degree of freedom is two, because we have three categories. Setosa versicolor virginica. That's why three minus one degrees of freedom would be two denom DEF, it's the degree of freedom for this column of sepal length. And we got very low p-value. So having a low p-value means we can reject our null hypothesis. So at least one pair of means are not equal. Now, let's compute the t-statistic for each pair of categories to find which pair is different. So I'm going to clear this variable fit, which will be equal to a of v. If you want to look for this function of AOV. Information here. So fit an analysis of variance model AOV mentioning these columns, species and data will be our iris. Porting it here. Not owning it. If I show you the fit, it's giving me the standard error estimate. Now, if you want to see the summary of this fit, you can call somebody function and it will give you some of scare. So sum of square means spreading the variances. This mean square is the pooled variance, MS EE, or sometime call MS E divided by two. So sum of squares is grouping all the variances to 1, summing all the variances. And if you divide your sum of square with two, then you will get a mean square of 606. Here is the large F values. And you are still getting the low p-value. We are getting three stars, which means this is highly, highly significant. So for 5%, we have one star, for 0.01, we have two star, and for 0.001 we have three star. So this p-value is highly, highly significant. Now if you want to see the different coefficient, you can write fit dollar coefficients and it will give you the coefficients intercept and slope. You can also find residuals. These are all the residuals. Now if you want to find the mean square error, Let's say I have this variable MSE. So summary, fit V1, two comma three. If I show you my mean-square error. So this is mean square error, which is very low if I scroll up. So this was the sum of square. If you divide it with two, you will get mean square. And this is the mean square error, which is the pooled variance actually. And you can also see it right here, 0.265 mean-square error. What I'm saying here is go to the summary fit. One means coefficient, and inside the coefficient go to two comma three. So second row, this will be the second row. And 1, 2, 3, third column. That's why two comma three. If I remove this two comma three, if I show you MSE, now you can see that we are seeing all the information of this summary right? Now as the sample size is large for this iris data, that's why we can perform post-hoc comparisons. So I'm going to copy this fit from air. Straight hair. And if I don't net, now I can perform. The keys are testing on this fit function because we had this iris data which has very large sample size. So instead we can use the F-distribution approximation. That's why we are performing this post hoc comparison. So this decade digesting will perform a t-test to approximate the distribution of maximum d values from all pairs of categories. Now let me add a comment for you here. So after earning Tukey's HSD, we got the lower and upper edges. And also you can see the difference between each of the group. So between versicolor and cytosol, we have a difference of 0.934, virginica and cytosol, we have a difference of 1.582 and virginica and versicolor has a very minimal difference of pi over 6 52. So we can say that the Virgin Islands dosa has a higher group means difference. Now if you want to plot this Tukey HSD, you can call the plot function and Tukey HSD and parsing the fit parameter here for Annette. So this is the plot which we got. So this versicolor and cytosol has a mid-range difference. This virginica and versicolor has a very minimal difference. But this virginica setosa difference is higher. So Tukey's HSD use t-test and returned us the actual differences between these groups. So we can confidently conclude that virginica and cytosol has a higher difference.
30. 11 Chi Square One Sample Goodness of fit Test: The next statistical test is chi square test. Chi-square test compare if two samples of categorical data come from the same population or follow a given population. We can use chi square distribution in t-test analysis. The value for the chi-square, which was denoted by X squared. Chi-square tests has two types. Both use the chi-square statistic and distribution for different purposes. In the first type, which called one-sample chi-square and also called goodness of fit test, determine if our sample data matches a population. This goodness of fit test fit one categorical variable to a distribution. The other type, chi-square test for independence, compares two variables in our contingency table to see if they are related. In general, it tests to see whether distributions of categorical variables differ from each other. In other words, chi-square test for independence compares two sets of data to see if there is a relationship. Chi-squared test returned values. A low value for chi square means there is a high correlation between your two sets of data. In theory, if your observed and expected values are equal, means no difference, then chi squared would be 0. An event like that is very unlikely to happen in real life when your observed and expected values are equal. A large chi-square test statistic means that the data does not fit very well. In other words, there is not a relationship. Unfortunately, there is no chi-squared test statistic threshold. By looking, you can decide there is or there is not a relationship. However, we can look at the p-values. If the p-value is less than the significance level, we can say there is a correlation and if it is equal to 0, then there is no correlation. Let's see the first type of chi-square in this movie. So I'm on this file, seven chi-square one-sample goodness of fit test. And here is the basic definition of goodness of fit test. Here is the default description, p-value significant level, let's say in our first question, our fitness trainer has determined are following counts of calories of one of its client. Now use chi-square test to obtain a p-value and test the following hypothesis. Our null hypothesis would be calories are equally spread during the week and alternative hypothesis would be calories are not equally spread. So first thing, first. Let's get this data in some vector. So I'm going to create a vector of calories per day, which will be equal to these galleries, which we got. So a vector of these calories. Now running it, let's also give it names. So name on these calories per day, a vector of this. Now seeing calories per day. So this is our data. And I can use bar plot to plot these galleries. So these are the galleries during the week. Now, I'm going to apply chi-square goodness of fit test on these calories. So just simply passing these galleries that said, now you can notice that we got x squared value, chi-square value. So this very large Chi-squared value means there is not a relationship and the data does not fit very well during the week, as I mentioned a couple of moments ago, that we don't have any chi-squared threshold which could prove that our value within this range represent a high or low correlation. That's why we still have to look at that p-value. So by looking at the p-value, the p-value is very small. We can say that all counts are not equal, so we can reject our H naught. Now, here is another example for you can get batsman are either left-handed or right-handed counts of 74 Australian characters found, 29 were left-handed and 45 very right handed. According to Australian cricket board, 45 percent should be left-handed and 55 percent should be a right handed. Now, our job is to use chi square to check if that data is consistent with this hypothesis. So this is our hypothesis. So I'm going to create a variable, a vector. Batsmen count. So we have 29 birds mints, which are left-handed and 45 right-handed. Now giving it names. Let's plot this. Now. This is the current situation of Australian cricket team. And we have this expected proportions from the Australian cricket board. So I'm going to create variable e, which will have this expected proportion, 45 percent. So 0 pint 45 and 55 percent, 0, 55, running it. Now I'm going to use chi-square and passing this batsmen count. And the probability, the expected probability will be equal to this, E, D, C. So if I run it, we got a very high p-value, which means we cannot reject our null hypothesis. That's why we cannot say that the borough portions don't match. So that's how you do goodness of fit test.
31. 12 Chi Square test for Independance: Hi, continuing chi-square. Now for the second type of chi-square, let's suppose I am given this question. Two drugs are administered to patients to read the same disease. We need to test whether the drugs are equally effective or not. So I'm going to clear this variable treat in which I will have this metrics as the chi-square test for independence compares two variables in contingency table. So we need to create content Gen Z table. So for this matrix, we have these values 42, 15, 14, and 18 in form of a vector. So it will have two rows and two columns. If I run it. Now, showing you this metrics. So this is a contingency table, but let's also give it called names and row names. So goal names. For greed will be covert North Kurt. And row names will be drug, alpha, drug Beadle. If I show you the street. So we have this contingency table. Now if you read this statement, these two drugs, drug a drug Beta, are administrator to patients to treat the same disease. So we got 40 to cured for drug Alpha, and 15 cured for Doug beta. We got 14 not cured for drug alpha, and 15 not cured for Beta. Let's make it 18 instead. So that we will not have a same arson. Now to test if there is a difference between drug and drug Beta effectiveness. I'm going to develop hypothesis. So our null hypothesis is there is no difference between drug Alpha and Beta effectiveness. And alternative hypothesis would be there is a difference between drug a and drug B effectiveness. So if I had done my chi-square on this greed contingency table. You can see I got a very, very minimal p-value. So I can say right here. As the p-value is very low, reject null hypothesis. Hence, there is a difference between drug and drug Beta effectiveness. Now, if you want to see different coefficients for the chi-square, what you can do, you can equal this chi square with some variable. Let's say cough. Run it. And now mentioning this cough, then put dollar symbol and it will give you different parameter. For example, for observed values. You can see we got the original data in the observed. If you want to go for expected proportions, if I run it. So these are the proportions which your drugs need to have in order to sport null hypothesis. You can also see it as you duals. The leftovers. Here is second question to answer. Let's say 250 adults were asked whether government should band fast food chain to discredit junk food consumption. The following results were obtained in favors against and no opinion. Let's convert it into a contingency table. And then we have to check this research question. Do men and women have the same distribution of opinions in this array? So I'm going to create this Vector ban, which will be equal to the metrics of this data. So 17.823636. And putting it inside a vector, and it will have two rows and two columns, three columns running it. So Gall names in favor or against, no opinion. And row names will be men and woman. If I show you this ban. So this is our contingency table. Now let's develop our hypothesis. So our null hypothesis would be men and women have the same distribution of opinions. Alternative would be men and woman does not have the same distribution of opinions. Now to test if there is a difference between the men and women distributions, I'm going to use, of course, chi-square test and pause this ban, running it. So p-value is lower than the significance level of 5%. So reject H naught. And hence, we can say men and woman does not have the same distribution of opinions in disarray. So decks Hall, you've COVID chi-squared test for independence in R.
32. 13 Correlation Test: The next statistical test is correlations. Correlations are used to measure the strength of the relationship between two variables. And correlation values are always in between minus one and positive one. So positive correlations indicate that when a increases, B also increases. So if you have plus one correlation, then it means you have a perfect increasing linear relationship. If you have positive 0.70, you could have a strong increasing linear relationship. If you have wind 15, then some increasing if point is 0, no detectable relationship. So it's like creating your own scale for positive correlated value. Now, in case of a negative correlated values, when a increases, B decreases. So if you have 0 correlation, then no detectable relationship minus 0.50, some decreasing linear relationship minus 0.70, strong decreasing linear relationship, minus1, perfect decreasing linear relationship. As the correlation measure the extent of a linear relationship between two variables. So the first type is Pearson correlation. Now there is one more type which called Spearman correlation, which may use the linearity of the ranks. Spearman correlation uses ranks to measure possibly nonlinear relationships. Both of these correlations look at two variables. Correlation does not tell us how to predict the outpost effect from the effect size. For example, to compute the Spearman correlation, the variables individually ranked, like replacing the smallest value with the rank 1, the second smallest value with the rank 2, etc. And after ranking, we simply compute the Pearson correlation of the ranks. Now the question is when to use which the Pearson correlation coefficient is the most widely used. And it measures the strength of linear relationship between normally distributed variable with the Spearman correlation is used when the data is not normally Distributed beers and lies in a parametric test space, whereas spearman is a non-parametric test that assesses how val, the relationship between two variables can be described using a monotonic function, which is flexible enough for your data. So if your data is in ratio scale or interval scale, use Pearson. If your data is an ordinal scale or nominal scale, use Spearman. Let's see correlation in R. So I'm in my directory and we have this file student before. After teaching intervention dot csv. So let's work with this data file. So deep into equal to read dot csv, passing that filename. And additional parameter file encoding. Utf-8 bomb running it. But before that, let's set the working directory to source file location. Now, plotting this data first, always a good practice. So before, after. So this is the data. You can also plot it using other. And let's say I give it after column first and then before column. And so I wanted to plot after, against before. So no comma here. I can also give it some sort of be CHE is equal to 16 and then get no correlation. You can call this function in which I can simply pass these two columns. So if I don't mention any argument here, like method, it will be default. Pearson correlation. If I had a net, you can see we got 71% correlation. And you can also give it one additional argument. Method is equal to BSN. So you will have the same desert if you want to go for spearman correlation, you can mention the Spearman method here, but only when you have data in ordinal like first, second, third, or nominal scale. So as I mentioned in the slides, that Pearson correlation is equivalent to Spearman correlation if you just drank it. So if I apply the rank function on this column and also on this column. Then we will have a Spearman correlation. Same as this. This was example one. Now to show you example, do we have this pulse before, after exercise data? So bowls data equal to now to find correlation core balls data. The pulse one column and pulls data pulse to column, running it. So it has positive some increasing linear correlation. If I Os for correlation in pulse, height and weight column, it still has some increasing correlation in the mid-range. If you want to see correlation for all the column of this data, you can simply boss score and bowls data. So now it has returned correlation between every single column. So correlation between height and height column will be, of course one, weight and height y and 57 ran and height 0 to gender. And ran will be this minus 0.033. So just passing your data name to this correlation function will calculate the Pearson default correlation between every single column. Now, if you want to get p-values for your correlation, you can install this package edge MIS. See. Now you can simply call the function from this package, our core, and give your values as metrics. So pulse and data. So it will give you the values for the correlation of your data. Now if you want to run a correlation test, for example, I can require this data Circe package in which we have this data cars. If I show you the head of this data, we have two columns, speed and distance, to run the correlation test on this 1. First, let's calculate the correlation. So gars, dollar speed, gars second augment desk. So we have very high correlation, 0.80. Now to run this core dot test. Correlation test. I can simply boss these two. And if I run it. So you can see that p-value is very, very small, which means there is a correlation between speed and distance column. Nor detaching my dataset baggage. And also this a champ IRC. So that's how you'd run correlation test in R. Now you have seen correlation in R. But as a final note, let me clarify one thing. Corelation is simply a relationship. Correlation does not mean causation. Causation explicitly applies to cases where action a causes outcome B. So it's possible that action, it relates to action B, but one event does not necessarily cause the other email to happen. So even though a and B are correlated, but it will be possible they are caused by something, actions C. So don't confuse yourself. So this guy is reading on the internet that cell phones cause cancer. The other guy game and say, what is World Health Organization is thinking? Now this keyboard activist, show him the reserve and say that look at this graph, Bro, as the number of cell phone are increasing, so does the cancer incidence. So this wise made say you are wrong, but this guy's still think that until further findings come, I'm going to assume cancer causes cell phones. Where what? So don't confuse correlation with causation. There are many examples available. For example, I have picked this plot from the mentioned source. So on the red hand, you have US spending on science, space, and technology. The black line, you can see suicides by hanging, strangulation, and suffocation. So even though the correlation is positive, 0.9999%, it does not mean that the US spending on science, space, and technology cause suicides. Right? Here is another example. Number of people who drowned by falling into a pool correlates with films Nicholas Cage appeared in. So does it make any sense to you, even though the correlation is 66%, that somehow people are falling into that pool and drowning just because Nicholas kid is making also movies or appearing in, let say, coolest rider doesn't make any sense right? Here is another example. Per capita cheese consumption correlates 94.7% with number of people who died by becoming tangled in their bed sheets. So does this make any sense to you? Of course not. Right. So when you pick two variables to check correlations, don't confuse it with causation. Correlation is very helpful if you are getting the same variable in a different point of time. Before and after guesses, product one and product two, effectiveness. Or in such cases where there is a real world relationship between variables. So I will end this lecture by saying venue, use correlation function, use your variables wisely.