Transcripts
1. Preview: Welcome to the zero Toe our hero course, where you will learn art programming and how to harness powerful statistical analysis tools for data signs. My name is David, and I am a professor on climate scientist with over 10 years of experience using our programming as well for a researcher and a teacher. I am passionate about teaching, and I strive to make my course content accessible to everyone, no matter what your experience level. I have designed this course to revolve around the use of real life data sets step by step instructions on practical hands on activities so that you can learn new skills were vote getting bald down in complex theory. At the end of this course, you will have developed a foundation of basic ability to program and are capable of performing widely used statistical analyses and are be familiar with voted. Generate meaningful graphical of results. The ideal student for this course is interested in expanding their data analysis capabilities on learn best by working with riel examples, there are no requirements to enroll other than a willingness to learn our programming. Feel free to take a look at the course description on. I'll look forward to seeing you in sight
2. Intro to R: I'd like to begin with a brief introduction to our programming and to talk about why it's a really, really good thing to use. Our first of all, it's absolutely free. The program itself is free to install all of the additional functionality that we'll be using is free to install so you can pick this program on your own machine or on your office computer and never have to pay for it. Are is also a programming language, so it's highly customizable. You can perform very unique, very customized announces tasks because you're writing out everything in could. It's also easy to rerun those complex analyses because it's cooled. Everything is saved in a script file, so you can open that up years later. UN. Performed analysis in exactly the same week is also really easy to automate and are so this is a huge thing for saving you time. You know that it takes a long time to do analyses by hand, especially if you're doing something really repetitive on a large data set. But because this is a programming language, we can iterated through that data set across those analyses. Tasks, unlike the computer, do the heavy lifting for us. It's also great and are because you can produce really high quality and really professional looking graphics so you can really elevate your findings in your results. There are also many diverse packages or extensions available and are so there's a lot of functionality built into the base program of our, But there are also many, many hundreds of packages out there that have been developed toe add additional functionality to the base program. We'll be using many of those. There are also loads of community. Resource is so there are so many people using are in the world that the chances are that no matter what problem you come across and you're coding someone else has probably already had the same problem that has posted a solution online. So Google is your friend. There are many blog's and resource is on help. Andi jittery ALS out there that you can tap into an find a solution to your problem are is also hugely widely used, an academic on commercial research, Mawr and more people are using our every day, and it is very much becoming the standard for statistical analysis and data science, and finally our as open source, meaning that functionality is continually added toe are so are, is always evolving to provide you with additional functionality to perform new analyses tasks. Having said all of that, there are also reasons why are can be a huge pain, especially when you're starting out and learning for the first time are as command lane driven. So there is no point in clicking your way to Nirvana. There are no menu systems on. Everything has to be quoted by hand. So along with that comes a very steep learning curve. It is just like learning a new language is challenging and it will take some time. But I am here to help you with this. So let's compare ours learning curve with the learning curve of other programs. Obviously, the longer you spend learning something, the longer you practice, the better you will become. But some things are harder than others. So if we look at ours learning curve there, you will see that is initially very steep, and the low we may lose some of you along the way. I am here to help you get over that hump and reach a plateau where you become a confident expert, our user, to illustrate just how widely are is used around the world. I want to ask you this question. I want to ask you, What's the Pirates? Favorite programming language? Well, it's are, of course, are metes. That's just a little joke. But do please join me in the next video, where we're going to look at examples of our code and look at how to read the cord, look at its structure and determine what it's actually doing, So I will catch you in the next one.
3. How to read code: in this video, I would like to briefly show you an example piece of our code and talk about how we recognize what the cord is doing, how to read the cord on how we think about its structure. So why do we read that piece of code? What should we be looking for? Should we be looking for letters or words? Should we be looking at the numbers? Should we be looking at the symbols that are used? Well, we actually shouldn't be looking at any of those things. First of all, no. First, we should be looking for the comments. Comments have been written by the person who wrote the code in the first place on the explain what the code is doing. So it's vital that we read those. Next. We're gonna look at the weight, space, the space, the empty lanes, the indentation that's round the boat, the cord that tells us a lot about the structure of the cord on what the cord is doing. Then we're going to look at what was in closing that whitespace. These could be words like function or end, or they could be symbols, including brackets, parentheses and curly braces. Then we're going to look at assembles words on letters on last of all the numbers that are being used. So as we look back at this example, piece of code, the first thing we're looking for are comments. This comment tells me that this piece of code is a binomial simulation. If I read the description underneath there, it tells me that this could is calculating the proportion off heads on tales that are achieved when tossing a coin. You'll notice that there's block of comments is enclosed by hashtags or pone signs and are the hashtag Roponen. Sane tells the computer, Do not read this. This is human Speak. This is English. I have not to run this. So this is where we look for explanations off what the cord actually does. We can then start to look at the white space here and online. 23. We see that there is some white space. This is indicating that Okay, we've had some comments. Now we're going to start doing some could. We don't have some more white space down on lane 32. So that's indicates that from 24 to 31 all of those lines of code are doing one thing, then online 33 to 44. We have another chunk of Courtis separated by some white space. So that says, another section of the court is doing something else. Another thing we have going on is indentation on lines 34 to 43. Every single line of code there is indented or tapped inward. That tells me visually, that every line there, as with n or part off the outer for loop that is going on here. Then we also start to notice that there are some curly braces on some parentheses. The curly brace in this instance is indicating where a four loop is happening. Everything within those curly braces from Lane 33 Toe Lane 44 we have code that is running on iterating within a loop. We also have parentheses and parentheses, air generally indicating where a function as happening. So here we have see parentheses. This is the concurrent innate function that joins together or groups elements. We also have symbols and operators in this code. If we go toe lane 26 I've circled something that looks like an arrow is a Chevron, followed by a dash or a hyphen that is the assignment operator and are so here online. 26. We're assigning a number tain to the variable flips down here. Within the for loop, we have an F statement, which assesses if I is greater than two. So this is a logical operator. It is assessing whether I is greater than two were returned, true or false. And then finally, we have unequal sign. I've drawn a circle around down here. This is coming within a function. Ondas assigning something to an argument called Main Within the function here is assigning proportion off heads a text string, so programs are simply instructions. They are sequential flows off instructions that the computer will read. So going from lines of code to nice output, nice figure. Nice draft is not magic. It's simply the computer reading those lines of court. So we have to think carefully a boat, the order that we present those in on whether the computer will make sense of them or not, according to endure Dextre, a big software engineer, big father off computing. All programs can be structured and to four possible ways we can have a sequence we can have branch ease, weaken of loops and we can have modules. So that is Luca, each one of these different kinds of structures. Briefly, because each one of these forms the basis for programming logic. We have strict sequential flow where we simply have one piece of cord, another piece and other keys, and it performs on analysis. So stretch sequential floor. Step 123 on Tada will then have something called a branch, which is a conditional statement. So this would be like an if statement. We have something that's done instead. One, it is. Then test eat. Some logical statement is tested. I never turns true or false, and that determines whether we follow Puff one or whether we follow path to so you can see the code is branching into two or more paths. Then we have loops, these air continuously repeating until the condition is met. So you could think of this as looping through a data table. A data table with, Let's say, 100 rose. You start on rule one. You perform some analysis within the loop, then you go grow 2345 So you iterated through 100 rose until you reach the last one on end of looping stops. Modules are simply putting many parts of analysis or steps of chord with an A function. So here we've got several steps parts of court that we're following sequentially. And instead of writing each of those lines out every single time, we want to have the computer read, Um, weaken put four of those boxes four of those steps of code with endless module or function , and then we can call all of that functionality. Any time we want it, we'll vote rating out all of the individual steps. So these are things each one of these aspects of code we're gonna look at as we go through this course. But beyond the structure, what else are we looking at? We're looking at variables. He's gonna have different types or different structures to them. There are vectors which are groups off the same element that are matrices, which are essentially tables off columns and grows. And then within our there is a special structure called a data frame, which is a table of data. But it can contain different variables with different elements, so it could contain numeric data. that could contain text data. It could contain many different kinds of variables In the same data table. We also have operations, these garlic verbs and the language these air action terms. They can be mathematical addition subtraction, multiplication division where they could be a logical operators like greater than less than or equal to. And then we have the way that we're interacting the dialogue between ourselves on the machine. So as we look again at this piece of cord, we can see branches within it. We have an if statement down here that assesses whether I is greater than two. If I is greater than two, we will run. The court is within that if statement. We also have a loop here that is looping for I and one colon trials. So this is saying that for one through the length of trials which we assigned farther up with, Lane 27 is 100. So for one through 100 eyes going to equal one in the first generation, I will equal to in the second generation 34 etcetera until it reaches 100 nanite will terminate. We also have marshals within this. Could we have some function on within that we have a sample function on within that we have a concoct, innate function. So here we are, encapsulating many steps within one function. And then we have data and we have operators. We are assigning numerical data to coin two heads, two flips, two trials to speed. We are also performing logical operations with the greater learned and also the equal sign . There you have it. This is how we look at cord. Have we begin to understand what it's doing, how we read cord. So no, let's get on without further ado and practice actually writing code.
4. Installing R and RStudio: So let's get started. The first thing that we have to do is install two programs. The first program is our itself. And then we have to install our studio, which is more of a graphical user interface that allows us to more easily interact with our So open your Web browser going Internet on Google or search for our download. And the first thing that should come up is the our project for statistical computing. Let's click on that Nana says that our website and then the left hand side we see a bunch of menus. We have a download at a cran button, so click on Cran. This takes us to a list of mirrors or servers that essentially hosting our around the world . So whatever you are, scroll down to a server or matter that is close to your location. I'm going to scroll all the way down to U. S. A. And I'm just gonna pick University of California, Berkeley. But you should pick something that is close to where you are actually located. When I click that I didn't get downloaded and stole our I have three options. Download our for Lennox download our for Mac and download our for Windows. So I'm on a Mac, so I'm going to click Download our for Mac. It's takes me to an additional page where I get a package installer for version 3.6 point one off our our is continually updated. So just come to this and take the most recent version that is going to download. Once it's downloaded, you can install it. The next thing that we're going to search for is our studio download On. The first thing that comes up is download our studio from our studio, Dr Cohn going to click on that. This takes us to our studio website, and we condone Lord, our studio desktop for free click download scrolls down the page a bit, and we can choose our operating system again. I'm one Max when I'm going to click the Makbul West Option, and that's also going to download, and then you can install it. You can and open up our from your applications, lest if you're on a Mac or from program files. If you're on windows and when you open our you get the are console window, so this is a very bare bones limited Look at our It tells you just a little bit about the program itself, and it gives you a flashing cursor, a command lane to which you can begin rating could. So, for example, I can write two plus two uncheck that that Yeah, it does still equal for that's good to know, so accurate. Or enter all of my code into the console here, but be a little bit cumbersome. If I want to create multiple lines of code and added them and change them and go back, open them, save them that kind of thing. So what we're gonna do is not use the are console itself. We're going to instead use our studio that communicates with it. So when we open up our studio after you've got installed, you'll see there are multiple pains, and the first thing we're gonna do is go to the file in our studio, and we're going to do a new file, our script, So this will open a new window. So we've got four pains or four windows here, and this new one do is untitled, empty script document. So here I can begin to write lines of cord and then I can run them. And the console that's down here in the bottom left window. So this is just a smaller version off the are console itself so I can write script in the top left window, run it in the bottom left window on. As I do that, I will begin to see that the variables I create or data sets that I'm working with are populated in the top right window. That's called the environment. That shows me every variable that I'm working with in my current environment. And it will give a summary off those variables and then in the bottom, right pain. We have a few interesting things. We have a tab for plots. So this is where all of the figures that we create will appear on we can export them. From Here is different file types. We also have a help tab, which is really, really handy. It gives us explanations off functions, so we can very easily get help. So, for example, I have the plot function already open here. It tells me what the plot function is. It gives me a description, tells me how to use it, gives me the arguments that it requires. And also, if I scroll further down to the bottom, it gives me examples of how to use that function as well as links to related functions. And I can look up any function here. So one of the first ones we will be using there's a reed dot table so I can type and read dot table here hit Return on. It gives me need documentation and help for it. It tells me what it does shows me usage of it tells me what the arguments are so really, really handy. Very easy to get help on in the our studio setting here. We could be looking at helping a function in the Boston right window. Why were writing or using the function and top left window and executing it in the bottom left window? So it's a very nice environment to get our our programming done to get help immediately on also, with this environment window up here, we can see what we're creating immediately and get an idea of whether it looks right, Colonel. So we're going to use our studio for all of our programming in this class, and I encourage you to keep using it into the future is a very cool environment toe work and makes our programming that much easier.
5. Intro to basic analyses: hello and welcome to the course. I'm really super excited that you're here on. I'm really excited to start our journey and two are together in this last. We're gonna learn so many new things we're going to continue to build on those as we work with riel Data sets through the class. But here's what we're going to learn specifically in this first section off the class. First of all, we're gonna learn how to load data into our our environment where they're gonna larrin ho, to prep or pre process the data to check it for possible errors on how to fix those errors that we may find. We're also going to learn how to create basic plots, including hissed a grams, box plots and scatter plots, as well as a few other kinds of basic plots. We're also going to be calculating descriptive statistics. These include things like mean and median their ways off describing a data say, on the population of interest. We're also going to be working with several different kinds of variables, including vectors on something called factors. We're going to be creating data frames which are really important data object within our programming environment. We're also going to create user defined functions where no only do we use inbuilt functions within our we're actually going to create some of our own. We're also going to learn how to iterating across the vectors on data frames, using looping structures in order to automate our code. And then lastly, we're going to perform inferential statistics. Things like T tests, a novas, correlation and regression. I mean difference between inferential descriptive statistics as the inferential statistics alos to ask and answer more in debt. Questions about our data set on to get insights out. Often the data that will be working within this first section of the class is called Time and Life. It's a data set I actually helped to create many years ago, and it's a data set where we asked individuals of the public, How old do you think you are? Members of the public were given a piece of paper with this cross hair diagram on it. You'll see that there is ah, horizontal black lane on a vertical black line down the middle, and they were asked to look at this cross here and imagine that the horizontal lane is time on the vertical lane, intersecting with it is the present were then asked people to think about where they are in their life and to draw vertical lines. On this diagram. We asked them to draw a line to the left of the central vertical lane, indicating when they were born on and draw a line to the right of the vertical central line , indicating when or where they think they will die. So we're really asking people to judge, Where are they right now? Compared with their life spans? In other words, how old do they really think they are? Are they close to death or is it much further into the future? So, for example, if I was to fill it that survey, I would put my birth fairly close to present because I'm not that old. And I would put my death where I think I'm going to die a vertical line I would put not way , way off into the future, because I think I'm gonna live forever. We would then measure these lanes on. We were determined how far someone thinks they are through their life. We could then compare this with actuarial tables that statisticians use particularly those involved with life insurance and selling those sorts of policies. Those tables tell statisticians how long someone is likely to live based on when they were born, so we could compare how old someone statistically is to how old they think they are on. When we did that, we could determine if their idea if I old they are was negative compared with how old they actually are. And if that was the case, that meant that they felt younger than they actually are. It means that they feel more youthful. They think they're farther away from death, then they actually are, and some people would get a positive number. Some people would be overestimating their time in life. One would think that they are older, then they actually are. And then we also ask these people for loads of other variables, things like if they were married, if they are employed, how much is their education level on? We would use those other variables to predict how old people think they are. So let's get into this data, say, let's start looking at it, and I know you're going to find it fascinating to gather these insights and toe analyze this data set and create some really interesting figures and visualizations of what we learn about it. So let's get started. I will see you in the next video.
6. Setting up your programming environment: before we get started with writing any actual code and are. I think it's a really good idea to set up a folder system for storing our code, our data on our output. So I have created a detection one folder on my computer. You can create us wherever you want in a place that works for you. If we look in my Section one folder, you'll then see that I have some folders for data output and something called SRC that stands for source. That's where I'm going to save our source code decode that were actually writing. If I click on the data folder, you'll see I've already done loaded the time and life data dot txt file that as our data file that we're going to be working on and reading in in the first section off the course. He could also see the night picked folder where we're going to store plots and generally output, and also the source folder where we're going to save our source code. So, having set that up, we're going toe open. Our studio in our studio is a great environment for working and are and you'll notice that we've got an untitled script file and the top left hand pain. If you don't have that, you can open a new script file by going to file new file on our script. This is where we're going to begin writing cord. So the first thing that we should do is named the script file and create a little bit off a comment where we are going to write what that's code as a boat, what it's for. So I have already created some script to do this. I'm just going to copy a end, but and you can copy it into your untitled document. And all this, as again, is hashtags. AARP own signs indicating to the computer that is not to read that those are human comments on Dive written that this is an introduction to statistical analysis and our I'm saying that we're going to do data prayer on basic analysis and we're using the time and life data originally prepared by me. David Keeling's Tonight we've got something written in the script file. We should probably save it, So to do that, we can push the save button or we can go file save as and we're going to navigate to whatever you created your section one folder, go into the source folder, and then we're gonna name this file on save it. So I'm going going to simply call this statistical underscore analysis underscore code. So giving it a basic description, something that I'll understand. And I'm gonna save it into the source folder. So now it's no longer untitled. We've given it a name, and we can continue to click the save button as we continue toe ADM. Or more lines of code to this. So catch me in the next video where we're going to start rating cord. I actually read in the data set and start checking it for errors south. See you then.
7. Reading in a dataset and changing column names: Let's get started with writing our first lines of our code. The first thing I'm actually going to do is write a comment to myself that says, This is where I'm loading and preparing the required data again. Writing comments are super important, not just to help you. When you look back at court, you've written, but also when someone else looks at your accord, it helps them make sense of what you were actually writing. So the function we're going to use to read and the data is called Reader Table. This is a function, so it has a name retort table, and it's followed by parentheses. Within those parentheses is where we specify arguments, separated by commas and threatens class. We're going to use many, many functions, so I encourage you to each time we come across a new function, pause the video on, Look up the hell on that function. So I've already loaded to help down here in the bottom. Right hand pain for reid dot table And remember, you can search for any function name using the search bar here, so read, not table requires three arguments. The first argument is the path and file name to whatever your data is on your machine. But to make things a little bit easier, I'm gonna use the file Dog chews function. When I run in this, this will actually open a finder window or Windows Explorer window and that low I was to navigate to whatever the data is on our machines. The next argument is set said is required because our needs to know what kind of separator is between the data and our data set. So you may have heard of CSP coma separated values, but we're actually going to use a dot txt data set here. So those air tab separated values. So between each column and not data set, there is a tab and our needs to know what that is. So it knows where there is separation between columns and to specify that we're gonna put double quotes. I'm gonna put slash t. That's how our nose a tab character. The last argument is header on. We have to specify whether header equals false or hander equals true. If header equals true, then are is going to assume that the first row of data and our data set is actually the column headers or column names in our data set doesn't have names. And so we're actually going to say Header equals false so that our knows that the first road data is just the first row of data on not call him names. Now. We could run this as it is, but it's much more useful if we assign it to a variable. So I'm going to assign it to a variable called Life Data using assignment operator Chevron and the Dash. So let's see what happens when we run this line of code using command return or control return, depending on what machine your own so immediately opens up a finder window. So this is where I'm going to navigate to my Section one folder, the data folder within that. And this is where I find my time and life data dot txt that I have already downloaded from the class website. So I'm going to click on that and go ahead on open it so immediately. You see that in the environment window talk right. We have life data is a data frame. It has 210 observations off 13 variables. So this is essentially a data table with 13 columns and 210 rows. We can see it in more detail if we click the white and blue arrow button it shows is that we have columns V one through the 13. So again we specified toe are that there were no headers in this data set, so our has gone ahead on made some now there. No, the most descriptive. So we'll have to do something about that. But it also lets us see a sample of the data that's in each one of these columns. It tells us whether it's an integer, a factor. Origin America's, that tells, is the type of data that's in each column. And then it gives us the first few bits of data they're in, each one off the columns, another way to view data and our is to simply tight life data, not variable that we created. And run that, and you'll see that our prints the data set to the console bottom left so we can see all 13 columns, but only the 1st 76 rows art does. This cuts off the data set so it doesn't fill up the console with particularly large volumes of data. Another cool way that we can look at the data set is used. The head function head will return by default for six rows of a data set. So we specified life data. But let's say we don't want just sex roles. Let's say we want 10 rows, but we can, in the second argument, specify how many rules that we want. When we run that we get just the 1st 10 rows off life data, another cool function to know as tail. This does exactly the same thing as head. But as you may have guessed, that actually looks at the tail or the end of the data set, and you can specify how many rules you want to see from the last rule back in the data set . The next thing that we want to do is fix those column names. V one through the 13 is not very descriptive, and it's not gonna help us out when we're trying to analyze this data set. So I have already prepared a list off the variable names that are in this data set. I'm just going to copy and paste them, and here when I return this down a little bit so you can see them. All these are all off the variable names. You'll notice that I've put all of them and double quotations. This tells our that there text strings their human language, not for it to read. Andi. I've also got Coleman's between them because the first thing that we want to do with these as we want to concoct innate them together using C function and this is going to create a vector over these like elements, all of these text strings together and then what we're going to do as we're going to a sign not to The column names off the data set so I can use the function call names. I can give it life data and then I can assign Maiken Katyn 82 text strings to that. Let's see what happens when we just highlight call names life data and run not It gives us the V one through the 13 so we can actually assign on top of those new names. So when we run this entire piece of code and it doesn't matter that we returned at the end of these lines because we have a coma. There are nose to keep looking for another line of code until it reaches the closing parentheses so we can run this entire thing together. And what we see immediately is that in life data TV one through the 13 column names have all been replaced with more descriptive text string names that we recognize as humans. So join me in the next video, where we're going to explore the status that further, we're going to check it for errors that it may have on. We're gonna fix those Evers so I'll catch you in the next one.
8. Checking for errors in a dataset: The next thing that we want to do is check the data set for possible errors. So I wanted to comment Here, check the data set for errors in one of the easiest ways that we can look for problems in the data set is to use this summary function. So we're going to say summary life data and run that. Let me just make the console maximized so we can see what happened when we ran in. Summary Life Data essentially performs a statistical summary on each column within the data set. So we get things like men median, Mean and Max, the first quartile in the third quarter, off the data in each column, and immediately I can see that there's an issue and the time column. I see that there is something here called any days on their 10 off them know in our speak, any means, not a number or in missing value. So that is an issue. We've got 10 all of them in the Time column. We also seem to have 10 in the age 10 and the death on numbers of missing values elsewhere , too. We also see that in the Sex column we have M for male and F for female 100 participants of each, but we also have a blank for 10. So because sex is actually a factor variable, it doesn't use any. But instead, when it sees a missing or non existent value, it creates a factor off nothing. So we'll also have to fix for that. Also, when I look at the age column, I can see that this was recorded in months, age of protest, mint in months and what I see. Our numbers are all fairly close together. But then I have a max value down here. 1430. If I divided that by 12 to find out that person's age in years, I would find that that person is over 119 years old. It is possible that we sampled a participant that's that old, but it's pretty unlikely. What is possibly more likely is that because age is close to the time in this data set, perhaps someone who recorded this data set has transposed 14 30 or 2 30 in the afternoon and for the age for participant. We also have 10 missing values in this cold I also start to notice that in body temp, the participants body temperature and degrees Fahrenheit have a very low value of 90.8. Near the average temperature of the human body is 98.6. So attention of 19.8 would mean it. This person was in hypothermia on probably dead. I also have a max value off 118 F. This person would also be dead. It is not possible tohave the human body at those temperatures, so we'll have to correct for lows. Another thing I noticed about body temperature is we have 11 Aeneas. So we have even mawr missing values here than in some of the other columns. So we're gonna go about fixing these. Well, first of all, let's look at another way to visualize these errors. A good way of doing it is to create a box plot. We can say box plot of life data, and I specifically want to look at age because we saw that there was something funny going on there with the 1413 months. So to do that because life data is a data frame, I can use the dollar sign and right age. And that means to our that I'm only interested in looking at the age column within the life data frame. So when I run in that I actually get a basic box plot off all of the data, and I can see that at 1430 value is way higher than everything else is a very strong out liar. I could do the same thing with the body temperature by specifying life data, dollar sign, body temp. I can run that and see that Yes, I have the 118 F body temperature, the person that is burning up on. I also have a couple of values that are far lower, far colder than the human body should be. So I can visually see those liars and see that there must be an error there. Another thing that I can do to visualize where all those missing values are is used. That is dot in a function, and when I supply the life data frame toe is dog, any what is going to do is look for anywhere that there is a missing value, and when I run, Not what do we get? Well, if I might, Somebody's the console. I can see it has gone through our data frame on everywhere where there is not in any, it puts a false value. And everywhere where there isn't any, it puts true. So as I'm looking dinos, for example, I see the on row 24. There is a true value. So that means that this is a missing value in Row 24. Andi COLUMN six. We haven't printed the entire data set because it's been cut off again to person preserve space. But what I can do is look at the tail of the data set. I can give it life data, and I can specify that I want the last 35 rows. I want to do that. I can see that. Oh, yeah, there's where all the other and the valleys are These there's where there are missing values and rose. So from row 201 to 210 it's all missing values. So this gives me a really nice way of quickly seeing where the problems lie. In this data set, another thing I can do is specify I want to pull out the rose off my data set where I have missing values to see just those specific rose. No, one way we can do that is to use the Is there any function again? And we put his daughter any within another function called Rose ums. So I'm pitting Rose ums, parentheses and is door any within it. So Rose Sums is essentially going to sum up the values off whatever in each row. So it's going to calculate the number off Aeneas because that's what is don't any is doing its returning. Whether there is an any there or not on Rose Ums is going to sum up those numbers off any in each road. So then I really just want to see Rose where their arm or Aeneas than zero. I want to see Rose where Aeneas exist, so I can put in a logical operator here greater than and I can specify zero. I want the rose where the number of Aeneas are greater than zero. Then I can use these double functions here with the logical operator to subset or extract from life data, and I do that by saying life data on any square bracket square brackets, and our ally was to subset a data frame note within square brackets. We have to specify the row on the column separated by a coma. So I'm going to put a comma here after zero. So here I am, counting the number of enemies that are in each row on. Then I am looking for Rose, where there are Maura, Anais than zero where Aeneas exist on. Then I am saying, I want every single row in the life data set where that is true by specifying the coma and then I leave the column designation blanks. I'm not putting anything after that coma. So this will people like all the rules where there are enemies on it, will cool out every single column across those rose. So let's see what happens when we run this on. When we run that we see, we get row 24 then we get road 201 through 210. So this is feeling it anywhere where we have a missing value. So in Row 24 we have most of the data has been collected, but body temperature is missing a recording here. We're missing the record for body temperature for participant number 24. But for participant 201 through 210. We're messing. Absolutely every Think off. There's we're missing all of the data and rose to one through 210. So do please join me in the next video where we're going to fix these errors were going to get rid of the Aeneas. We're going to get rid of the out liar values on. We're going to replace them with proper values and fix this data set. So I will see you then.
9. Fixing errors in a dataset: having identified the errors in this data set, the next thing we want to do is to try to fix those errors. It one of the easiest ways that I can get rid of missing values in a data set is to use that any daughter met function. Let's see what happens when I run any doctor. Amit, you can see that it prints to the console Needed a set. Remember from before that we had a missing value off body temperature in row 24. So when we use any dog omit no, only is it going to remove that any value from that rule. But it's going to remove the entire rope. And that may not be desirable because it was the only missing value and wrote 24. There were other recording data points, so we don't want to lose necessarily all of that data that was actually collected. Okay, again, you can see here that it goes from Road 23 to row 25 missing road number 24. So another way I can get around this is to use the roast sums and is thought any functions that we used previously, So I'm just going to copy them down to here. What we see here again is that we're counting the number of Aeneas in each row on. Then we can pull out just those rows that have mawr than zero Aeneas from the data set. So really hear what we want to do is cool it rose that have less numbers of Aeneas. So depending on what your decide for your data set, you can determine how many missing values you want across columns. So here we're going to change. Zero 22 and we're gonna change are greater than sign toe less than so. This is gonna pillow rose from the data set where there are less than two unease. So it's gonna polite Rose, where they're zero Aeneas, where there's no missing data and it's got a cooler rose where there is just one missing data point. So if I change this to three, it would pull out rules where there are 01 or two missing data points. So this is what you could decide. How many data points missing is still OK, We're just going to say that we only want one missing value, so one variable across the rule can be missing. So let's see what happens when we run that I'm going to make the console larger. We're gonna go out to see if wrote 24 is still there. Yes, it is. We still have wrote 24 included in the data set with that missing value for body temperature. So now how do I create the data set with just those single missing values? And well, I can simply go to this lane 30 that we just wrote. I can write life data, and I can assign the output off that line of code that we just run. So here I'm selecting all of the rows where there are zero for up to one missing value. Animas signing all of those rose to life data. So I'm writing over the life data set with just the rose that have zero or up to one missing data point. So remember we had all those Aeneas in the last rose and rose to one through 210. Let's see what happens. No, when we run this and then we run a summary off our new life data object. What we can see again is our statistical summary off each column. I know. We see, for example, that time and age and death don't have those 10 and a Zen them. So we've gotten rid of those rows that had multiple Aeneas. But we still have room 24 that has one in a It's not showing up in body temp. So the next thing that we want to do is we want to look for Rose that have a missing value so we can go back to your code. I know we can see life data again. We can use that Rosa dysfunction and is dot any and we want to see values where there are more than zero missing values in a rope. When we run that this is gonna cool out just the row where I have a missing value against it pulls out Road 24 the only missing value is that one for body temperature. So how do I pull out just that missing value? I'm only interested in that missing value for body temperature. That's really what I wanna get at. And I want to replace with something. I want to fix it. So one way I can do that is again by using is dot any and I can specify life data dollar sane body temp. So I'm pulling note every any value from body temp. And then I am using that to subset in square brackets from body temp, the entire column of body temp. So again, I'm specifying Life data Dollar ST body temp that cools just that column for a body temperature. I can then use the square brackets to subset from body temperature. So, for example, if I do life data body time, I just run that. He'll give me all 200 values off body temperature, including the any value on Route 24. So that is essentially a vector of numbers and Aiken subset, not Victor by a logical vector. So as though any remember is testing for as they're on any value or isn't there, it's returning false where there isn't and true whether is so, it will risk turn a true value on Row 24. So if I run these two together, I'm going to be sub setting that vector off body temperature by True's on falsies on the only place where it's true isn't the 24th value So that's why it will only return. And a 24 value is an A No. What can I do with that? Well, I can replace that value with the median off the column. So I can say that where that any is that I've just selected out. I'm going to assign to that any I'm gonna sign the median value off life data Dollar sign, Body temp. I'm going to say I want to any dot r m in equal. True. So here I am saying that this any value is going to be replaced with the median value off all body temperature measurements that we did have recorded in the data set. And I'm going to say any RM equals true for remove any any values. Because if I run the median statistics on life data body temperature, it is going to return an editor because they're still in any value in their present. No, we're going to replace that any value with the median value. So let's see what happens when we run that we've run it on. No, let's run is any again. The first part of this we get a numeric zero. There are no any values because we've replaced that one in line 24 with the median value off the data set. So if I go to life data Andi, I specify that I want the 24th row. All columns remember the first values rose and coma. Next values columns. If I leave columns blank with any square brackets that will return, all of them on die run That it gives me wrote went for And now I see that body temperature instead of being any instead of missing, has been replaced by 97.4, which is the median value off all recorded body temperatures. So it replaced the missing value here with a U value. The next thing that we want to try to do as Luke for the out liars and body temperature. So remember no, only did we have a missing value, but we also had some that were unrealistic just not physically possible. 118 F and close to 90 degrees for those would be really too far really to call, that person would be in serious, serious medical problem. Hey would have a really big problem on their hands. So what? We're going to do now is look for those outlier values and one way we can easily access out flyers as by saying, Block box plot, dot stats And we're going to give it the life data on specify body temp. So this is going to run in the box blocked shots that's function on just our body tape column. So let's see what happens when we run that it gives us multiple things. It gives us statistics it gives us, and for this sample size gives us confidence. And it gives us something called out these air out flyers. In the data set, we've got a 91.41 18 and 19.8 on a 11 So these air probably unrealistic measurements for the person's temperature. These are errors. There was something wrong with the thermometer or the person using. It was not appropriately trained, so we want to get rid of these. We want to replace them with MAWR realistic values. So one way that we can do that has come back to her box plot dot stats on. We can specify from it using the dollar sane vote so we can access that vote values the four values at phone were statistical outliers here by using the dollar seen on the output from a function right. So we've used dollar signs to access columns and a data frame, but we can also use the dollar sign to access attributes or output or values from within a function. So when we do just that line of code, we get just those four values that box plot dot stats identified as though players. So the next thing that we can do with this is we can assign them to variable oats. They are the players that we can run that. So no Oates, we'll have appeared. And our environment we have outs that says it is in numeric victor is the numbers for all of them. 18 91 1 No, we can replace those outs values with the median from body temperature, just like we did with the missing value. So what we want to do is access life data Onda. We want to access in particular any body temp. So we're going to put life data dollar saying body temp and then we're going to use square brackets to subset from just the body temperature column of life data. We're going to say where life data all are saying, Body Tim is in our votes vector eso We want to extract any value that is one of those four that was identified as an out liar. So we can do that and are by doing percentage and percentage outs. That's what this is doing is saying that any value of body temp that is with N the four values and outs as part off that outs of Enter is going to be extracted from the body temperature. Cole s So we're subset in here again. Let's see what happens when we run this. So it says that we've got those four values because we're pulling out the body temperature values there in the out liar vector. So we've got the same four values that we have in outs, and now we're going to assign over those we're going to write on Top Waltham. The median value for life data dollar sign, body temp. We're gonna use that any dot our am argument again just to make sure that we're not being messed up by any missing values. So let's run this I know. Let's do a summary off life data once again just to see how much we fixed. So let's make the console larger and let's look at this summary. So we've just been working on body temp. So now we see that the minimum for body Champions 94 that's more acceptable. We see the maximum is 100.3, and that's probably that person's a little bit hot, but that's still within from the bones of reality on, we also don't see any missing values. So this data set is just a boat fixed. We don't have any missing values anymore. Onda. We also have fixed that unrealistic values, those unrealistic values, off body temperature. So next time we want to do is fix that really unrealistic 119 year old person with the 1430 months off their age recorded. So what we can do to fix that is, say, life data dollar sign age. We're going to use the square brackets to subset life data, dollar saying age. So we're seeing struck from life data where life data is equal to 1430. So we have to use The double equal sign is a logical operator and our if I use a single equal sign. Our knows that as an assignment operator. So we want to assess here whether they're equal to or not. So we're using the double equals sign. So we're pulling out from life data age where life data age equals 1430. If we run just that, we get 1430 because that's where 1430 equals 1430. We're going to assign on top of that, the median of life data dollar sign age, and we're gonna use the any. Don't remove argument again just to make sure that we don't have any missing values that are affecting a calculation of that medium. So we're replacing that 1430 within median age. So the median age run that your quick is 307 and 1/2 months. So we're replacing that value of 1000 former and 30 with three and 7.5, so we can do that on Let's just check the summary of life data again. Let's go and look at age. Age is no saying that the minimums 216 the maximum is 811. So we've gotten rid of not 1430 value and replaced it with the median value of age. The last thing that we have an issue with here is buying to that six column. We have males and females 100 of each, but then we have this blank and we have zero of them. So that's where we had any values missing values in the sex at a tribute call. So we really want to get rid of those missing values, but they're not recognized as any because it's a factor variable. So we have to do something a little bit different here to fix that. What we have to do is we have to say, drop levels, and this is going to look for factor variables with n life data, the first argument and the second argument and drop levels as what you're going to draw as a level and the factor. So if we specified M or F here to get rid of every male or every female, and not factor variable But what we want to get rid off is then missing values the blanks that we have. Zero detainee it up. So we're saying exclude equal to nothing in quotes, no space, just nothing between those quotes. So no, we can Assane that to life data and that will remove that empty factor variable from our six cold. I know we can run summary of life data one final time. I know we can see that the sex column just has aims for males s for females and no empty values are empty factor levels there. So we've entirely cleaned this data set. We've gotten rid of missing values. We've replaced missing values with more realistic values than median values. We've also identified out buyers in the data set our erroneously high and physically impossible values and we've replaced those with median values and we've also removed an empty factor variable. So join me in the next video where we're going to start looking at the variables in this data set, making some plots and performing some very basic analysis. So I will catch you in the next one
10. Exploring a dataset (histograms, box plots, descriptive statistics): Hello and welcome back in this video, we're going to make some basic plots and run some sample statistics on the continuous variables. In this data set, you'll remember that there are continuous variables like age, the difference measure. And then there are categorical variables, like sex on education in this data set. But first of all, we're going to explore the continuous variables. First thing that I've done is created a box with a comment inside it, saying that this is where we're exploring these continuous variables. So that's just visually separates this section of cord from the previous section where we were fixing the data set. And the first thing I'm going to do is change the graphics device. So the last time we made a box plot like this one in the bottom, right, you saw that, too, caught the entire plotting window and remember that we can export it from here using save his image or save as pdf. But let's say I want to create two plots on. I want them to be alongside each other. I want to see both of them at once. What I can do is use the par function and I can specify the argument MF Row and I'm going to say that equals. See one coma, too. So I'm concoct in ating together here. One coma to the one specifies that I want one rule, and the two specifies that I want to columns. So essentially this is going to divide, are plotting window down the middle, and it's going to allow me to plot one plot first on the left side, and then the second plot will go on the right side. So we run that. Essentially, we don't see that it's done anything, but it has divided up that plotting window for us. The next thing I want to do is I'm going to create a esta graham off that dependent variable, which is the difference measure. If you remember the introduction to this data set, you'll remember that a difference measure is negative when the participant thinks that they are younger than they actually are on its positive when the protest mint thinks they are older than they actually are. So let's make a hist a gram using the test function, and we specify to that the life data Stoller, same diff for the difference measure or dependent variable. We're gonna label the X axis over this history, Graham, using X, light or X lamb. We're gonna label the X axis with the text difference measure where that has to be in quotes because it's human language, so are will not read it. We're going to say the main title of this plot and the main argument is equal to any. I'll explain why that is. Later on, we're going to give the Hester Graham bars eight color, both sky blue, then are you can specify colors as text strings. Let me open this document that shows you colors Andi. How are recognizes them as text ring so you can specify some very specific colors using very specific names for those colors on our will. Recognize them. I encourage you toe play around with this and try different names of our colors. You can find a description of the zone lying. This is ah, pdf with with the different colors on their names. I'm just going to use sky blue because it's a nice color for Let's Run This first. Hester Graham. There we go and the history Graham has appeared on the left on site of this plot as opposed to being centered and taking up the entire plotting window. The next thing I like to do is I want to add a representation off density along the X axis off this history, Graham. And to do that, I'm going to use the rug function. I'm gonna supply life data dollar saying death. See what happens when we run, Not keep watching the hissed a gram to see what happens. You'll see that all of these little dashes have appeared along the X axis. So I guess it's just another visual indication off where the data lies in our data set, as well as having the bars themselves on the highest a gram to take me up this plot a little bit. The last thing I'd like to do is create a bounding box around it, using the box function, this box open and closed parentheses, and that creates a nice bounding box around my plot tidies up a little bit, makes it look a little bit more professional and polished. The next thing I'm going to do is create a box plot that will appear on the right hand side . So I'm going to say box plot. I'm going to give it life Data Dollar seen def and then going to specify a Y label for the Y axis of the box plot as difference measure on. Then I'm going to say again, Main equals any. I'm not giving a main title yet, and I'm also going to say color equal to Skyy Blue to keep it consistent with the previous plot. Let's add the box plot. There you go. We get a nice box plot in the second plotting position. The second column off this plotting window. Now I'm going to actually restore the graphics device back to a single plotting window, and to do that, I use the part function that we used above. But I'm going to remove the to and could a one there. So I'm back to being one rule one column or one single plotting window. You'll see that didn't do anything to our plots. It does not change what's already plotted. It will only act on what we've yet to plot on. No, I can add a title to the plot comment to myself here that I'm adding a title. I could do that using the title function and I can specify main equals difference maker. Let's see what happens when I do this. No, I get a main table of difference measure centered above both plots. If I had not specified, Main equals any back here in each one of these plots that I made. If I'd given them a table, then they would have had individual titles above centered above each plot. So using part in this week, no Onley Aloes is to plot separate plots on the same page. But also Aloes is to add overarching titles above multiple plots. If we set plotting graphics device back to a single, very sees one Row two columns. So that's a neat way that you can make your plots that much more professional looking. The next thing I'd like to do is create a new data frame on in it. We're going to store summary statistics, so I'm going to call this new data framed If and I'm going to say that def is a data don't frame, and within that function I can start to create a data frame on specify different columns of data. So the first column I'm going to label N for sample size and I'm going to say it equals the length of life data dollar seen death. Let's see what happens when we just run land Life Data dollar. Saying death that returns 200 land for is a function that calculates how long a vector is. So it counts how many rows it has. This is saying that life data dollars and if it's 200 participants long, so that's as we know it to be. We can then create another attribute called Men. And we're going to say men equals the men function applied toe life, data dollar, same diff. When I run, the men function by itself. I get in minimum value in the data set. I'm gonna sign not to men. I'm going to calculate a number of these summary statistics. So I'm just going to copy and paste in something that I prepared earlier so you'll see that we've got land from We've got men. We also have max mean median variance and standard deviation. So I'm using the inbuilt functionality within our to calculate each one of these statistical measures on Nana Miss signing it to a variable name within this new data frame that we're creating. Let's see what happens when I run this. It creates a new data frame, so you'll see and our environment up here we have a new data frame called Death is one observation off, seven variables. So if we look at death, we see that we have and 200. We have a minimum of maximum, a mean median variance and standard deviation. So we've created essentially a seven column one row data set off these different statistical values. In the next video, I'm going to test your new phone coding skills by asking you to create hissed a grams and box plots as well as statistical summaries for a different one off the variables in this data set. So I will see you then.
11. Challenge #1 - Create plots and statistical summary of a variable: Hello. In this video, I am setting you your first challenge. I want to test your coding skills just a little bit on. I want you to create a hist, a gram, a box plot Onda, Statistical summary for the age independent variable in our life data data frame. And to give you a hint for this, this should be really similar to the code above that we did for the variable. So, essentially, I want you to do exactly the same thing, but for the age variable. So go ahead and pause this video note. Try to work on that. And if you get stuck, come back to this value because I am no going to show you the solution to this challenge. Good luck. So now you're back in this video and I want to show you how to go about doing this. So the easiest way to do this is actually to come back our inner code, and we're going to go ahead and copy everything from laying 47 down through lane 66. We're gonna paste it underneath. Our challenge on essentially everything here is very similar. Except we have to change the difference. Variable to the age variable. So the first place I see the difference variable here is in the history. Graham, I want to change life Data dollar saying death to life data dollar sign age I can and change my text here. I'm no longer going to be plotting the difference measure. Instead, I'm plotting age and I can add something to make it a bit more descriptive. By seeing in months to avoid confusion, I'm gonna leave the rest of this. The same person is imploding. Something different here. Let's change the color. Let's say it's going to be someone that's the color in or Onda. We're also going to change the rug underneath the plot, the density to age instead of deaf. They were going to do the same down here in the box plot. We have to change death to age. We're going to change the label of the Y axis to age in one of some gonna copy that thing from above. I'm gonna change the color also to salmon and then I'm going to give these a title off participant age instead of difference measure. So this is the title overarching title for both plots. I'm going to say participant age. The next thing I'm going to do is change this Def data frame two aged eight Efraim and replace the summary statistics over the difference measure with that of the age variable within this data frame. So I'm just going to change death to age. And then I'm going to go through and change everywhere where the death data is referenced to age. So I'm just going to go ahead on copy this and change each one of these. You have to be careful when you're doing this. No, to paced in their own place or two. Remove any off the parentheses, so almost done replacing every one of these deaths with the age variable. So let's go ahead and run this and you'll see that Now we have a judge and our environment . Onda we can look at what it is, is and men, max mean median variance and standard deviation. So we have ah statistical value for each one of these measures on one observation for seven variables, Essentially. So it's seven columns off one rule across each of these variable names that we created within it. So there you go. That's the go to completeness challenge. I hope that you'll manage that without looking at this solution. But if you didn't, don't worry about it. This is still very early days. Let's continue with writing more exciting code on trying to improve your understanding of our on learning Really cool things. So I will catch you in the next one.
12. Exporting plots and statistical summaries: hello And this video, I would like to show you how to set up a poot storage for statistical summaries for anything printed to the console and also for a graphical opiate like plots. So to do this, we're going to create a pdf Onda text file to which we are sinking or out putting everything we're doing in the our environment. So let me create a co meant here at the beginning off our continuous variable cord that we wrote in the previous videos and then just saying it here. We're going to set up figures putting to a pdf on tabular or statistical in food to a text file. And I'm going to be using file paths and directories here. So you have to make sure that you are going to change these to what is appropriate on your machine because we do not share the same file paths. So the first function that we're going to use as pdf on the argument that I'm gonna use as file and I'm going to specify a file path No. And Mac, that's what I'm using. I can go to my finder window and I can go to my output folder of Section one and my You'd Amy class folder. No, on a Mac when I right click on Oh, put, I get a copy option to copy that folder, but if I hold down option you will see I get a different copy option. I get to copy Oh, put as Path name as long as I'm holding down the option button on the keyboard. So that's what I want to do. I want to left click while holding down the option button and copy Okkert as its path name . You'll see when I copy this back into the R script, I get my file path with the slash she's facing forward. The next thing I have to do as put quotation marks around that, and then finally, and the output folder, I have to have another forward slash and create my file name, which I'm going to call corn vars dot pdf. Because on this section of the cord, we're looking at continuous variables, and it has to be a dot pdf. The next thing that I'm going to do is create a sink out for text and for things printed to the console. So I'm going to say sink function, and I'm going to specify this same file path. So I'm just gonna copy down everything within the quotes. I don't have to say file equals and the ST function. I just specify the file path. Now I have to change the file that it's going to to conv ours dot txt because the sink function will okay everything to a text file. Nor pdf No, If you're in Windows, this will be a little bit different for you. So I'm going to leave a comment for you to help you figure out how to get your path in here . So if you're using Windows, you have to either copy the file path from the Windows Explorer address bar or you have to hold down the shift key and right click on the folder. So hold on, Shefki and right Click on the folder. No, this is only applicable in windows machines. And when you do that, you will get something like this. So if my computer was a Windows machine, I would get something like see Dr Coloane. Then I would get back slashes through my folders through my directories to that output folder in section one of the you damage folder and again that will be different depending on where you created these folders on your own computer. But there's an issue with this because you notice the's slash ease their back slashes and are is only going to recognize forward slashes. So you have to go through and manually change these to forward slashes, note back slashes, or you can do a double backslash and our will recognize that, so either way, you have to do one of those things. A possible worker into this. If you're in Windows, if you don't want to manually change all of these slashes as to assign something to available called path, we're gonna use read clipboard. This is a function, and our if you're using windows only that will read what you have copied to the clipboard. So you've still gone to your Windows Explorer to the address bar and copied the address to something that looks like this that will come in with the backward slash she's. But when you run this read clipboard function, it will call P from not clipboard, and it will automatically convert it into port. Are is capable of reading. So within path, you will actually have stored something that looks like that. So it has converted it into the former Arkan read. So that's just one way of working around it. And Windows A let you play with that and decide for yourself which you prefer working with ? No. The next thing that I want to add to the code is printing out certain statistical measures . So remember previously we created this death data frame that had the statistical summaries , like sample size men, max mean median, etcetera. So we're gonna add after that lane those lanes of could we're going to leave a comment that says, when a print not data frame rounded to three digits, I'm going to use the function print, and I'm going to put in double quotes death. This is just gonna print deaf as human language so that I know what it is. And then I'm going to say print again. I'm gonna print the death data frame that we created up above, and I'm going to do coma Digits equals three that will roamed that data frame and the data within it to three digits so that I have it nice and trimmed don't. The next thing that we want to do is we want to come down and do the same for the age variable that we complete is part of this sections challenge. So after we create the age data frame, we're going to do the same We're gonna print agent quotes on and we're gonna print the age data frame rounded to three digits on the very last things that we have to do are we have to stop sinking. We just say sink open and close parentheses. We also have to turn off the graphics device using dev dot off open and close parentheses on When we run both of those functions, it will stop Oh, putting to the text foil. It will stop about putting to the pdf, and it will save those off so that we can open them until we run. Sink and dev dot off It will not close those files on. We won't be able to go and look at them. And then the last things that were going to do we're going to remove age and we're going to remove death using the R and function. This removes both of those variables from our environment so that we don't get confused with them later on. Know that we have out put them to a text file. We no longer really need those variables and our environment. So let's go ahead and run this entire section of code one more time. I'm just gonna call meant these windows specific things that I showed you. So we're going to run from lying 47 all the way down to the end here, and you'll see that in the environment we know only have our life data frame that we started with, and we also have that outs value from fixing the data. Say we've removed age and we've removed if you'll also notice that nothing was plotted here . That's because the plots, when you have a PdF running you are sending the plots directly to that document. They will not appear in the plot window in our studio. So let's go to our folder, a folder that we sent these things to, and we see that we have a corn fire stop. PdF and we have con viers about txt. Let's look at the con virus dot txt first, and you'll see that we have death and then our sample statistics off mean median men and max etcetera for difference measure. We also have aged printed and then the summary statistics for it. So this is a great way off storing statistical output so that you can always come back and open it. It's not getting lost in the console that you've printed it too. You will have it there so you can come back and do further analysis on it later. If we look in the corn vars dot pdf What do we see? Well, we see our difference measure the Hester Graham and the box plot that we created previously . And we also have a participant age hissed a gram in the box plot that we completely previously. And this is in a two page pdf document that we can open offer we can export to whatever we can use and publications or further analyses. So there you have it. That is how you know poot graphical output on statistical output. Two folders on your machine. So join me in the next video where we're going to explore the categorical variables in the data set. I will see you then
13. Exploring categorical variables (bar plots, proportions, multi-part plots): Hello And welcome back in this video, we're going to explore the categorical variables within the data set and run some very basic sample statistics because their categorical variables were basically just gonna look out the proportion that are in the different levels within each categorical variable. So the first thing that I would like to do is actually look at a table. We're gonna perform counts of each factor variable at each level within it. So, for example, I can run the table function on life data Dollar sane six. Now, you may remember that the sex variable is M for male and F for female. On that, we have 100 off each. I can briefly check that again by using the str or structure function. I'm just going to run this in the console and I can run that on life data. And when I do that, you'll see we get the summary just like we did in the drop down menu in the environment. And we see that sex is indeed a factor. Variable with two levels f on em. So let's look more in debt at the sex bearable by using the table function when we're on table. We see that we have 100 EFS for female and 100 am for males. We can run the table function on other categorical variables. For example, we can say table life, data dollar saying marriage. This is the marital status over the participants. What we see here is that it's all zeros and ones. We have 132 zeros, people that we're not married and 68 ones people that are married so you can go ahead and play around with the table function on all of the other categorical variables. But the next thing we're going to do as we're going to set up a graphics device that Carlos is to open multiple plots to this same window just like we did previously. And we're gonna set up a graphics device here where we have to rose and three columns and our output window. So to do that again, we use par and the MF Row argument we're going to specify they can Katyn eight function, and this time we're gonna pick to calmer three, signifying to rose by three columns on. We're going to run that so that gets our graphics device primed for having six plots across it. We're gonna have a role of three and a proportion of the plot, and then we can have another row of three below that. So the first plots that were going to do for our bark looks. So we're going to create bar plots off the proportion of subjects and each factor of each categorical variable. And to do this, we're going to say bar plot were using the bar clock function on. And instead of the first argument being the data that we want to make enter bar plot, we're actually going to specify the table function on. We're going to say table life, data dollar saying sex. So we know that from above that that returns 100 males and 100 females. But now what we're gonna do is we're gonna divide that by the length off life data dollar saying sex. Remember, the land function returns the length off a column or a set of data. So here we're having a calculate the length of sex, which is 200 participants on. We are dividing the table off 100 females and 100 meals by that length of 200. So both the 100 females and the 100 meals we'll both be divided by 200. Then we're going to specify color. I'm just going to use sky blue again, but feel free to play around with those colors wherever you fancy on. I'm also going to specify why limb why Lim is an argument that restricts the why access to a set span or a set limits. And because we're calculating proportions here. I want the Y axis to be limited between zero origin and one because if are labeled the y axis 1.1 or 1.2, then it wouldn't make any sense because you can't have her proportion that is greater than once. I want to make sure that our is behaving itself there. The next thing I want to add into this function is on X label for sex, and I want to have a Y label for proportion. And then I'm going to say that the main title of this plot is sex off participants. I'm then going to specify that the names used to name the bars and this plot are going to be female Andi meal instead of just using AM. And if there we go, let's run that. You can see that we get a small bar plot in the first sale of our two by three matrix. We're going to tell you that up a little bit by adding a box boundary just like we did before. It makes it a little bit cleaner. I know you can go ahead on do the same thing with other categorical variables, so I'm just going to copy down the code. The same is above, but I'm going to change sex to the marriage. Categorical variable on. Then I'm going to change the X label to married on Leave Proportion as it is, and then going to say that the main title of the plot as marital status on dime going to change the names of the bars to know not married Andi. Yes, we are married. Let's try running that. There you go. We're getting nice bar plots. They're showing proportion off categorical variables in each level off the categorical variable. So whether we have males, females or whether we have married yes or no and so you can continue to do this for the other categorical variables, and they will continue to populate in that plotting window. Remember that when you finish plotting, you should always restore the graphics device back to a single plot configuration. Just so you don't get any sudden surprises when you come to plot your next item. And so I will catch you in the next video where we're actually going to begin to get inferential with for statistics on its begin to ask more specific questions using statistical tests like T tests on Nova's. We're going to start actually queer ing this data set. So I will see you in the next video.
14. Working with data types (vectors, factors, data frames): Hello. In this video, we're going to be preparing for performing additional statistical analysis that will allow us to answer inferential questions about her data set. So I've created a new comment box here that's going to separate this part of the code and explain what as we're doing on First of all, we're going to analyze categorical variables it before we do this. It's very beneficial if we first of all attach the variables from life data. So the column names and life data is what we're gonna attach to our environment. So let's see what that is. We're gonna use the attach function. I'm just supply it with life data. So we're supplying attach with Life data, which is our data frame that has column names within it. And when we run that, you'll see that it has attached the following variable names. So, no, we can actually use those variable names directly without having to reference the life data in our environment. So, for example, we used to to get a summary we usedto have to do life data dollar sign age, for example, on that would give us a statistical summary of the age variable within life data. But now we've attached it. We can get rid of the dollar saying we can also get rid of life data and simply say age. And I will recognize what we're talking about so we can do that and get a statistical summary old age know that we have attached these variables. We also want to tidy up our factor or categorical variables a little bit and give them levels within their categorical structure, names that are more descriptive. So, for example, in sex, all we have at the moment is am for male or F for female. We really want to actually call them meal and female. So what we're gonna do is we're going to say sex. We can refer to that no directly because we have attached it to our environment and we can say factor function sex. And we can say labels equal to see parentheses, female and male. What we're doing within this factor function is we're saying the sex variable change its labels change. That's levels within the categorical variable of sex to female and males, or transform FFT, female and AM and two male on overwrite the sex variable. So where are signing on top off the sex variable. So let's see what happens when we run that. No, If I do a summary off sex stone here and the console see, I get instead of f n m female and male. So I have gone ahead. I'm prepared code to transform our categorical variables and two more descriptive labels. So here's something I prepared previously. We have marriage. We're changing the labels from zeros and ones, too. Not married and married. We're changing the job. Variable from zeros and ones toe unemployed, unemployed family variable just has to do with whether the participant feels they had a lot of family surrounding them to support them. So whether they have no family or they have family and then the education level of the participant, Whether it's hs for high school, you grabbed for underground or grad for graduate degree. You'll also notice here that I'm using an equal sign to reassign these variables and are you can use an equal sign or the arrow operator that we've been using previously to make assignments, but I prefer to use the arrow operator. Equals to me, would be too easy to confuse with a mathematical operator, so I prefer to use the IRA sign, But you can use equals if you wish. So let's go ahead and run these other variables. We're changing them to something much more descriptive. You can see here that they're appearing in our environment. The next thing that we want to do is create data frames for the dependent variable on for the categorical independent variables. So the dependent variable is what we're interested, and it is the response variable the difference measure. We want to determine or predict what difference measure is and participants based on all of the other variables that we collected. And we're also going to use categorical, independent variables, which are the predictor variables that were trying to protect the dependent variable from. And that's what we're going to look at here. But first of all, we want to create a data frame for the dependent variable. We're going to say that the dependent variable is coming from deaf again. We've attached this so that you can refer to it without the life data frame. We're also going to say that we're making a life data frame adult cat for categorical variables and we're going to say data dot frame and we're going to give it sex a marriage job, family on education So we're making a new data frame here that has the sex marriage, job, family and education variables within it. So let's go ahead and run both of those. You can see that they are also added to our environment. And then the last thing we want to do is create a vector of variable names. We're going to create a vector or variable Names on were to do that. We're going to say vars variables thought categorical, and we're going to supply it with the column names remember, we use not function before going to supply a column. Names of life data dot cat So that has taken the colon names from our new life data dot cat data frame. Andi has assigned them to the variable vars dot cat. So if we run, just vars dot cat, we see that we have sex, marriage, job, family and education. So when we come back in the next video, we're going to continue our statistical analysis of these categorical variables, and we're actually going to do that using our own function that we're going to create. Instead of using existing functions, we're going to create our own, so it's very exciting on. I will look forward to seeing you in the next video.
15. Creating a custom function for statistical analysis (t-test, ANOVA, plotting): hello and welcome. In this video, we will be defining our own function. The purpose of the function is to look for relationships between the categorical variables and our data set. So things like education, sex, family, employment and marital status on whether any of those variables haven't impact on the difference measure. So how old or young someone thinks they are? So really, what we want to do is analyze differences in the factors or levels within these categorical variables. So, for example, we want to see whether there's a difference between males and females and how they perceive how old they are or whether there's a difference between married and unmarried couples and how they perceive how old they are. But before we actually write out this function, what I want to do is show you what the output is going to look like and what we're striving to create. So I've made the console over about larger so that you can see the output that we will be getting, so you'll know this year that says dollar sign female. The first thing that lets function that we're going to create will do is it will create a statistical summary off the difference measure for all of the female participants. So we'll be able to see what the median and mean etcetera or female participants as on. And we'll also be able to see what those summary statistics are for the male participants. So if we look at this real quick, we see that they mean for female participants as an active 0.57 and the mean for male participants is a little bit lower at negative 3.437 So just looking at these means I would possibly imagine that males tend to think they are younger than they actually are because we have a more negative value on average and the male participants, The other thing that we're doing here is creating a plot. So if we look entire applauding window in the bottom right, we'll see that we've got a box plot. This has split up female participants and male participants. We have labeled the X axis each of our on we've labeled the Why access a difference measure , which is what we're trying to figure out, and you'll see here that these two box and whisker plots look pretty similar the male mean with the bold line is a little bit lower than the female mean, but they don't look all that different. So, really, what we want to do is see whether those differences and mean that we're observing in the box PLO and in the statistical summaries is statistically significantly different. On one way that we conduce that is, to use something called a T test. A T test looks at differences between means and to sub populations, so it tells us whether they mean in the female group off. The difference measure is really statistically different than that in the mail group on whether, in fact, the females and males along and different populations as there really a difference between males and females when it comes to how old they think they are. So when we run the T test, we get this statistical output. What we're most interested in is the P value that number here, the 0.2453 p value tells us about significance it because this is an inferential statistical test. We have a no hypothesis, and we have an alternative hypothesis. The null hypothesis is that females and males are no different and how they perceive how old they are, so there's no difference. And they mean between females and males. Statistically terrorist, if I profits is, is that those mean values of difference measure are actually different, that there is a difference between males and females. And if the P value goes below 0.5 we're going to reject the null hypothesis that there isn't a difference on. We're going to say that there is statistical evidence that there is a difference in reality between males and females and how they perceive how old they are there. Because that's P values 0.2453 That means it's not less than 0.5 We can not reject the null hypothesis, and we can't say that there's really any difference between males and females here. So the other thing that this function does is it adds the P value to the plot. Over here, you'll see that we've got P value 0.2453 and the last thing that the function does is it labels the box plot with a title. This is sex, cause this is the difference between meals and females so and it was seen what output we're going for. Let's look at how we write the function itself. So I'm gonna go ahead and minimize the console. And this is the function that I've already written. I've written some comments starting on lane 175 and and the function is initially named on Lane 180 it runs until 222. I know this is quite a lot of code many lanes here, but we're going to go through it one line a time on really get at and explain what it's actually doing. So I encourage you to First of all, listen to the explanation off the lines of code that go into this function before trying to actually copy down the cord. I would leave that until afterwards, or posed the video and do that so we'll start off with the comments that I've left us here . I say that this is a function for analysis of categorical independent variables against the continuous dependent variable. So again, our dependent variable as our difference measure that is dependent on all of the other variables that were sampled. That's what we're trying to protect. And the categorical independent variables are the variables like education, family, sex and so on that are categorical meaning they're not continuous. They are zeros and ones male, female there. No, a continuous measure on functions can come as part of packages you install and are, or you can create your own. So what we're doing here is defining our own function on we're naming the function corn cat fun. That's a name that I came up with. You could call it whatever you want. I've called it not because we are trying to predict or looking at a continuous variable the difference measure on. We are looking at the influence that categorical variables have a not, and this is a function. So I've called it funds or Cone Cat fun dysfunction will take three arguments. So we know that within parentheses is where we could arguments for functions. We're gonna have three of them in this function on this is where we're defining the function and all of these lines have could. But once we have to find it, we can then call it any time we want without having to write out all of these lines again, so we just call it and we give it to three arguments and it runs in the background. Refocus, having to write all of these lanes of court out again and again. So that's the real power of functions that allows you to use a complex analysis task over and over without having to write toe again and again and again. So online 100 eighties, where we name the function we're gonna call it Cone Cat fun. So it's just like we're defining a variable. Were using the assignment operator, the iro, the Chevron with the dash, and we're saying that Kong cat fun is going to be assigned a function. So that's a function, function and blue there, and we're saying parentheses and when naming the three arguments that it requires. So I've called the first argument each of our for this meaning this as each categorical variable. So we're sending one of those attained to this functions. I've just called it each of our where then got the second argument. It's called name. That's the name of the categorical variable, and then we've got response. That's the dependent variable or the deafness measure So that's what we give to this function. The first thing that we do offend the function is we have a curly brace here and that really defines where the actual functionality begins. So we've named the function with government the three arguments. But this is really the meat of the function. This is what it does so within, Not Curly Brace I returned on. I wrote a comment myself saying, The first thing we're going to do as print somebody statistics off the dependent variable by each independent variable and factors within it. So this is where we print out a summary statistics that we should look that previously in the console. So the mean the median except for females fair sees males right, So we're doing it for each level within the categorical variable. So for education, we retire bust a testable summary for those protest mints that only at high school education those that have undergraduate education and those that have a graduate education . So to do this, we use the T apply function. You can look that up in the help essentially t apply is going to perform or apply a function across rose over data set. So we're giving t Apply the response variable of difference measure and we're saying that we want to apply a summary function across all of the difference measure values that are corresponding to the categorical variable on each element within it. So this is where we're saying, OK, do a summary across first of all, all the females within the sex categorical variable and India somebody across all the meals within the sex Categorical variable We're then going to print that to the console. So that's how we get our output that we looked at down here off female men first portale medium mean etcetera. So the next thing that we're going to do is we're going to do a box plot. We're going to create a box plot off the dependent variable subdivided by each independent variable on factors within it. So we use our books plot function that we've seen before. We specify that the box claw is response and then we use the tilde operator to say we want response against each variable that we're giving to this function. So first of all, we give it the sex variable. We want to plot response difference measure Verre sees each factor and the sex variable. We're going to say that in main title off, the plot is equal to name, so name is one of the arguments of the function. So we will be supplying it with the name of the first fight factor variable, which is sex. So that's where it gets the six title from over here in the sports plot on and we're supplying it with Why Lab For Why label? And we're saying that's equal to difference measure. So that's gonna label or why access. The next thing that we need to do is have a branching statement and our program flow here because there is one categorical variable that doesn't have two factors within it. It has three. The education variable. We have high school undergrad and grad, so if we look at our environment again, we see that education has three. Family has to job, has to marriage, has two levels and insects has two levels. So there's a possibility that we will have a categorical variable come into this function that has three levels were fended off high school undergrad and grad. And that's an issue because the T test for seeing if there's really a statistical difference within the levels in the factor variable like we did for male versus female T tests can only be used when there are too levels within the factor. So if we have more than two like we do in education, then we have to apply on analysis of variance test instead. So to account for this, there's possibility off a three level factor coming to this function. We have to use if statements to branch the cord. So if a two level factor like sex, male and female comes to this function, we go down one branch. If a three level factor like education, where there's high school undergrad and grad comes to this function, it will follow a different branch on do a different analysis. So that's why this function is really smart. So we're going to do an if statement and this if statement is assessing whether the levels in the each bar are equal to two or not. So how are we doing that? Well, we're giving it each far, which, as let's say, our sex variable and then the levels function here. The interior function is going to first of all, Teyla's how many levels are in each far. So for sex, it's going to say to, And then we're going to calculate the length over those levels so it's going to count one to a nen. We're going to assess with the double equal sign that logical operator we're going to assess as this number equal to two. So, yes, with sex there will be a meal and female. There's going to be two levels. The length is going to equal to, and therefore two equals two. So this if statement would be true. And therefore we would do what comes afterwards following the curly brace. That's how you set up a if statement, you say f parentheses and you put a logical statement that you want to assess. And then, if that's true, you're going to the court is going to execute what comes following the curly brace. So if it's sex that's going to be true on, then it will perform a T test, so we will perform t test using t dot test on. We give that the response variable of the difference measure on whatever variable we've got given to the function, so If we're summit sex, we will assess sex male, female on a difference measure and perform 80 dot test function on that on. Then we're going to assign open of that function to the variable T test. We can man print t test prints, the output of the function, and that's what you saw down here in the console. And it prints this entire statistical output to the console so that we can see it. The next thing that we want to do is have a if statement with in this if statement. So we're nesting an if statement here within another beer with me. So the reason we're doing this is because we make it a P value that has many zeros, many digits after the point. Andi, if we have a P value that's going off and too many digits beyond the decimal point, that can get kind of messy. And it's not really the done thing or really proper to report a P value that has many digits after the point. Really, If we had, let's say, 20 digits after the point, So 0.0 etcetera. Then, in a statistical sense on a publication to be professional, you would want to just report that the P value as less than 0.1 that's all you would say. Because once it gets to be lower than 0.1 we don't really have that much confidence and that anymore. And we don't want to report that in a publication. This if statement is going to assess whether the T test P value has gotten really, really small and really long past the decimal point or not. So we're going to say if and then we supply it with t test. The variable we created above with the local off the statistical T test were going to say dollar ST p dot value. So the T test function output has attributes within it, and we can access those attributes using dollar signs just like we did and data frames and one of those attributes from a T test as paedo value. So we're saying here that if the T tests P value is less than 0.1 then if that's true, we're going to create a variable called P value and we're gonna sign it a text string in double quotes there off less than 0.1 So that's what we're doing. If this is true, that's why it is placed within those curly brackets there. No, if that's not true, we don't see else. So this is what happens if the P value is not less than 0.1 We skip assigning P value to the text string less than 0.1 and instead we come down and we sape Evil is going to be assigned the T tests p value, but we're going to round it off to four significant digits. So I'm using the Roman function. But I'm saying Roman to value of the P value that we get the T test function and I'm using its second argument off digits equals for so again, this Nestea. If in here all it's doing is looking at the P value from the T test that was performed above here, and it's looking and assessing isn't less than 0.1 If it is and I'm going to create a variable, has the text string less than 0.1? But if it's not, I'm going to round up P value off to four digits, and I'm gonna say not to the variable p evil. The next thing that we do is we're going to add the P value to the plot. Cyanosis. Over here, this box plot that we created has P value colon 0.2453 above it. So that's those reporting on the plaudits. Nice and convenient. Then for when someone looks at less plot later on, they see these two box plots. They see they're pretty similar. Maybe they're wondering, Are these really the same? What are they different? Well, having the P value right here, we'll tell them that. So we're going to paste together the text p value with the actual P value that came out a statistical test. So we're going to say, Pete, Evil lab for P Value label. We're creating new variable here, and we're assigning to it that paste function and paste the lows is to paste text strings and numbers together. So we're pasting together that text the human language off in double quotes, P value corn and then saying comma P. Val. So there's P evil is whatever we saved up here in this if statement. So if it was a really small number, it's going to be less than zero point user one. If it wasn't less than 0.1 PVA will actually equal the rounded P value from the statistical test. The next thing that we do is we add that to the plot. We add AARP evil P Value label to the plot using the M text function. So that's how we managed to get P Value call on 0.2453 above this plot. So that's P value was not less than 0.1 It was 0.2453 and probably some other digits, but we rounded it to four significant digits there. No, the rest of this function basically does the same thing, except we have the else F statement here that is going to assess if the levels off are variable are greater than two. So this is going to catch education. Education has three levels, so education will go through this analysis at the bottom off. This function that we're defining, which essentially is exactly the same as what we did for it to level categorical variable above. But the only difference says we're going to be doing or performing on analysis of variance because there are three levels. An analysis of variance essentially looks for differences in means across three or Mawr levels in a factor variable. So that's what we have to use here. So it's going to perform an analysis of variance, and it's going to do the same thing to assess whether the P value is very small. And if it's no, it's going to round it off and then it's gonna add that to the plot. So that's what will happen when the education variable is supplied to us function. Again. I encourage you to watch this video again. Listen to the explanation of this function. I know that this has Bean the most complicated thing that we've done so far, and it's not easy when you start to define a function, and also when this function is relatively long and doing a lot of different things we have . If statements nested within another F statements and we have some statistical tests thrown in here to again, Rita read the comments here on Gua Church. The video explanation again. But basically, this is function we've defined. And within the function we create statistical summary, we create a box plot, and then we have a branching statement that closes to cope with whether we have to factor levels or we have three factor levels, and that's alone is to branch the cord within this function. So do look at this some more. When we come back in the next video, we're actually going to use this function on we're gonna nest this function within a looping structure that allows us to iterated through all of our categorical variables and have the computer do the work of all of the analysis for us. So this is going to be hopefully very useful, very important video, and I hope to see you. Then I will catch you in the next one
16. Writing a loop for running the custom analysis function: hello and welcome. Back in the last video, we created our own user defined function to perform very specific analysis tasks on our data set. So what I want to do in this video assure you how to call or how to use that function? And we're also going to place that function within a short, four loop looping structure that will allow us to reiterate that function through our five independent categorical variables. So our education, family employment, marital status and sex variables alos to automate analysis instead of rating all over five times. So the first thing that we have to do is we have to run that function that we created so that it becomes part of the our environment and our recognizes that as a function. So I'm gonna go back up, scroll back up through the script that we wrote back to where we defined the corn cat fun on line 180 I'm highlighting it all the way down to the closing curly brace on line 222 and I'm going to go ahead on run that and you'll see that it runs in the console. We ran all of those lanes accord the entire function definition because he hasn't done anything right. We don't get applaud. We don't get any statistical eight poot because that is just as defining the function. We haven't sent anything to it yet, so we've defined the function and you'll see the end environment window. We now have a head in cold functions and we have con cat fun and it's here. It says it's a function and it tells as the three arguments that we defined that it requires for it to run. So the next thing that we're going to do as we're going to create a living structure where we can actually use this function, So the first thing I'm going to do is create a comment for us basically here. I'm saying that this is a loop that we're going to send categorical variables to perform the analysis and the function using the function that we defined above. This loop is going to reiterate or continue to run through the land off all the categorical variables that are in the categorical data frame that we created earlier, and we're going to use one variable at a time, so if we look back at our environment up your top right? Remember, we created a life data dot cat. This is a data frame that only hires are five categorical variables in it. But of course it has old 200 participants. So it's five columns and 200 rose. So to start off this loop, we're going to write four. And we're going to say I this as the index term or the number off federations that the loop is on. I'm just calling it. You can call it whatever you want. I tend to think that I is good because it stands for index or federation, and I'm going to say in And then I'm going to write the number one Coloane. And then I'm going to say length off life data dot cat. So I'm saying here that I'm going to calculate the length of life data dot cat. Let's see what happens when we just run length of life data. Dog Cat tells me it's fine, right, because there are five variables in the life data dot cat data frame doesn't tell me 200. That's a number of participants. It gives me five being the landfall of the number of variables and I'm saying one through five. So in this lipping structure here, we're basically saying that I is going to equal one on the first Federation than it will equal to 345 and then it will stop. So that's all that this is saying. It's saying that I will equal one. It will gain equal to 34 and five as it generates through this loop. So what we're going to do in sight the loop? Well, we use currently braces again, going to create a few blank lanes for us to write this loop. You'll notice that the in dense for is automatically, but this is to help us visually. See that. Okay, this is a loop on. Everything is indented is what that Luke is doing on every iteration. So if we're going to explain things to our function, we have to first of all, define variables for the arguments that the function requires. So I'm going to say each of our remember That's one of the arguments that the function acquires, and I'm going to say it. Life data dot cat and I'm going to use square brackets to subset that data frame and remember, within square brackets we can give to numbers. We give the rule number Coma column number. And if I want all off the rose, I leave it blank. Remember, before we wanted all columns So we were leaving at blank, so I could nothing coma. And then I specified which call him I want. And here I'm going to say I because I in the first generation through this loop is going equal one. So in effect, I am saying I want to subset from the life data dot cat data frame the first column all rows. So I'm standing the entire first categorical variable to each of our I'm creating a new variable called each far that has the entire categorical variable in it and then have to define nameless another argument that the function requires. And I'm going to say the name is going to eat. Well, vars don't cat remember vars dot cat we defined previously and it as a vector that has the names off Oliver five categorical variable sex marriage job, family, an education. So I'm gonna put square brackets after it to subset it, and I'm just gonna put I because this is not a data frame. Bars dot cat is just a vector off. Five things Sex, marriage, job, family and education is only one dimensional. So I only have to specify I with no coma here on when I equals one on the first generation . Through this loop, it will take the first element of ours Dog Cat, which is sex on the second generation of the take marriage job, etcetera as it loops through. And then the last thing that we want to do inside of us for loop is, say, con cat fund. So we're calling the function we created, and we give it the arguments that it requires each far name and then the third argument that saying it requires there as response on for response, I'm going together life data dot devi the dependent variable the response variable and you'll see in our environment that life data dot devi, as a numeric vector that has 200 values, has got the difference measure for all of the 200 participants. It's not a data frame. It's just a vector of 200 values. I'm giving it all of those 200 values to the function. Now you may be slightly confused thinking, Whoa, we said each far name they shouldn't this third argument be response trip? We have to call it that. No, that's just the name of the argument that we defined in the function. We can give whatever we want as the response argument. It just so happened the I called each bar and name here the same as the arguments and the function. So each far here I'm defining a new variable outside of the function with the first column off the categorical data frame. And then and I'm saying the name off that variable is coming from bars. Dark Cat. The first element of that I could have called these anything I wanted to and put them into the function in place of each far on name. So that's the entire four loop there. And as this iterated it is going to run the function, giving it one variable at a time. Then it will do the second, the third, the fourth and the fifth. So, to give you an example of this, we can go ahead on run. This for just threw one federation and to do that I'm going to come down to the console and I'm going to say I I'm gonna sign I the number one. So that's a lows. Me to run what Sen saved us for a loop where folks letting it iterated through everything. I'm going to say I is equal to one and then I'm just going to run the code that is inside the four loop. I'm not going to run the four loop itself. I'm just running this cord once, and whatever I is will no be equal to the number one. And so when I run that I get the output that we saw before I get the statistical summaries for female and male difference measure their what? The differences between male and female. I get the oap fruit from the to some pulled t test with the P value to get the ball explode over here with a P value added to the top of it. So haven't done not you've seen those results before. I showed you those when we were defining the function in the first place. Let's see what happens when I say I I'm gonna sign it too. So neither let's go on Highlight. Just what's in this for loop again. Just the insight. The meat of the four loop on run that know that I equals to know. You'll see it's gone to the second categorical variable Onda. We can see here in the console that we've got the summary for not married, and we've got the summary for married, and you can see that we've got a mean for those that are not married. That is positive. 3.432 so appears that not married people think that they are older than they actually are. And then for those that are married, we have a strongly negative 12.556 Mean values appears that those that are married think they're younger. Then they actually are. And so if we want to know whether that's a statistically significant difference or not, we can look at the P value we see here. P values a very small number. This is 9.238 e to the negative 10. That's for scientific notation. So this is 19.238 times 10 to the negative 10. So that's a very small number. Very significant, much lower and 100.5 threshold that we used to reject the null hypothesis, meaning that we can see that there is a difference statistically between those that are married and those that are not married. So when we come over and look at the box plot that it created, we see that not married much higher values, then married. So it appears that when you're married, you think you're younger than you are on. The P value from the T test confirms that, and this P value is very small. So the function has correctly reported as less than 0.1 as opposed to writing out in 9.238 times 10 to the negative 10. So this is interesting appears that if you're married, you you feel younger and you maybe feel happier. So that's Ah, that's a result for the books so you can go ahead and run this four loop. No. And when you do run the entire thing, it will go ahead, and it will cycle through all of the variables and the plotting window. For example, the last variable that produced was the education variable, and we see a slightly significant difference here between high school undergrad and grand appears. Those with more education feel younger than they actually are, which is is sort of interesting. And we can go back to older plots by clicking this blue left arrow here so you can also see family support. It doesn't appear to be a difference if you have family or not. Job. Whether you're employed or no appears that people that are employed that have a job, they tend to think they're younger than they actually are. And that's a statistically significant difference there. It's not less than 0.1 So it's reported that rounded to four decimal places and then we're back to marriage. And of course, we also have sex variable there we've seen before, so you can cycle through plots here like that. And then in the console, you will get the printed statistical output If you scroll back upwards. First of all, get the sex. Okay, then we have marriage. Then we have employment, their family and then last of all, we have the education level on the results of the Unova because it was three levels with n that categorical variable. Let's say that you want to automatically saved these things off, which we've done before. Let's say that we want to, As we're running this for Luke, have the plots populate in a pdf document on Do we want to have all of this statistical printed to the console go to a text file so that we can have those readily available and open them up when we're not still in the our environment or on another machine, or send them to someone or whatever? So what weaken due to a low for that to happen is we can set up a pdf function and a sink function that will send these analyses and these plots out to our machine. So I have added a pre prepared fuel lines of court here on 2 25 3 to 28. What we're doing here again, a setting up storage toe low, the figures to go to a pdf document on the statistical and four to go to a text file. So I'm using the pdf function. I'm specifying where this is going on my machine. I'm using the path here to a pdf file. I'm calling cat var stats and my coat folder over my Section one folder for the Demi class . And then I'm using the sink function and specifying a path to a text file that I'm calling cat VARS stats again. You will have to change these paths and file names to wherever suits your machine, where your want your output to go. And remember, if you're on a Windows machine that he slash she's have to be forward slashes and not backsplashes. And then after the four loop, we will want to stop sinking out to the text fail. And also we will want to stop sending figures out to the pdf. So to do that, we close the sink by saying sink open and closed parentheses and then dev dot off to close the graphics device. So let's see what happens when we run these lines of code all together. So we've done that on, and we didn't see anything happening with the plots. We didn't see anything printing out to the console or the record that we're running. But no, If we go to our folder on our machine Andi, open the finder window here. If we go to the folder on our machine, I'm gonna stroll down to my jammy folder. I go to my section one onto my output. I now have two new folders here, one called cat vars stats. And this is where I get a PdF on on each page of the pdf document. I have those figures that we've created. So I conceive those in the pdf and look at him any time. And then I also have the cat far starts all of that statistical output readily available there and it text file. So in the next video, we're going to go ahead on do a very similar few steps of analyses undefined in new function for analyzing our continuous variables as opposed to our categorical variables. So we're going to continue our analysis on complete our examination off this life data set or time and life data set. So I hope that you're having fun. So far, we're drawing already some really meaningful insights from these data. Andi, I will catch you in the next video
17. Writing a loop for creating scatter plots: hello and welcome by the next thing that we're going to do in the script as we're going to start to analyze the continuous variables that we have in our data set. These include the age, the body temperature, things like heart rate, level of participants support network as well as how they rated their own health. So I've created a comment box here that's going to separate this new section of code for is visually and I've said that this is continuous variables. We're going to run descriptive statistics. Once again, we're going to make some scatter plots, look at some correlation analysis as well as do some regression analysis and run some diagnostics for that. So within this section of cord, we will be defining another function we'll be using effin else on will also be cleaning out , put up in making plots look a little bit better. So the first thing that we actually want to do is create a data frame for the continuous independent variables, just as we did before for the categorical variables. So what I'm gonna say is life data dot corn. Remember before we had a dog cat, but I'm saying life data dot corn for the continuous variables. I'm going to assign to this new data frame data dot frame function. So I'm making a data frame and I'm gonna give it the variable age body temp. Remember that we can refer to these just using their names because we have attached them to the our environments that we no longer have to reference life data Dollar saying each one of these weaken. Just use the names as they are. We also want Teoh attach heart rate support level on health. These air the remaining continuous variables in our data set. So let's go ahead and run that. You'll see that we get a new data frame called Life data dot corn. It's 200 observations off five variables, and I can use a little balloon way Arab on toe. Look at what they are of age, body tamp heart rate support on health and it gives me again samples of these thing. They're all either integers or numbers on do. They are old continuous. The next thing that I would like for us to do is create a vector off variable names for the names of our continuous variables, so I'm just gonna make a comment to myself. I'm creating a vector of these variable names and I'm going to create a new object called vars. Dr. Korn is like we made our virus Dark Kat previously for categorical variables. And I'm going to use that call names function again to get column names. And I'm gonna extract the call of names from life data dot corn. The life did ago dot com data frame that we just made. So I'm gonna get a column Names from it. So far on that I know have a new object here value in my environment called vars dot cone at as a character vector with the five names of those continuous variables, just like we previously made vars dog cat for the categorical variables, the next thing we're going to do is make scatter plots of all five of these continuous independent variables. And to do that, we're gonna use a short, looping structure once again. So this will give you a chance to explore loops a little bit further in practice, writing those again, we start with four. We're going to say for I and one through the length off life data dot corn taken life data dot corn is the new data frame we've created. If we run length of life data dot com, it will tell us it has a land for five because they're five variables within that data frame. And they were singing this for Luke that I is going to be equal to one through five. So in each generation, the first generation I will equal one the second generation of equal to three and so on. And to start the loot, we have to use a curly brace. I'm going to create some room and cite the loop again where indented notices the functionality that's gonna happen inside this four loop. So it's indented, and we're going to write each of our a new variable called each of our When we're gonna say life data, Dr Korn and we're gonna subset not using the square brackets once again Say comma, I again weaken subset a data frame using square brackets. The first number is the rule number and then comma column number. So if I leave road number blankets going to return all rose and I is going to subset one column times on the first generation. Through this look, I will equal one. So I will extract just the first column off life data dot Khan and I'm assigning not to new variable called each of our I'm going to then write name creating a new variable, cold name on. I am going to say that that is vars dot corn and I'm going to subset that using ice again. This is just like what we did before for the categorical variables for virus dot con. It is a character vector. So there are five names and not vector. So in each generation through here, I will be pulling out one of those at a time, depending on what I is equal to, whether it's 1234 or the fifth element in vars Dr Korn. And then, instead of calling a user defined function like we did previously here, I'm going to use the inbuilt plot function. And I'm going to say each of our life data, Dr Devi, so that as our previous data set that we made up, that is just the 200 observations all the dependent variable the difference measure so that as a numeric vector here in our environment. We've got a number game, Eric Vector 200 observations off those different scores, So I'm sending it the continuous variable. I'm standing at the difference measure on. Then I'm going to say that the X label, the Y access label, is going to equal name, which I've defined here as the first variable name and my vars dot cone. And then I'm going to say the Y label as equal to difference measure because the difference measure is going to be constant on the Y axis, no matter what Variable were sending for each continuous variable. We're always assessing that against the difference measure so I can label the why access axes difference measure. Get rid of this extra lying here and let's see what this does when we run it. Once again, we can come down to the console. We can say I is going to be assigned one, and we return not to run it. And then we can run just the lines of code that are within this for Luke. And there we go. So it is running just the first variable, which, if we look at the bars dot com, the first variable is age and we can see here that we've got a judge. This is and months again. We could have princes and months, but we're just using the name. So we're just seeing age on. We have a scatter plot with empty circles indicating individual data points with the relationship between the age on the recording difference. Measure off the participant so we can go ahead on run this entire four loop and we'll see that actually is made all off the figures. So the last variable was health. Howdy. Participants rated their health. We can use the Blue Arrow to go back through previous plots that with me. So we also can see support here how the participant rates are measured, their support network. I support it. They felt on. We also have heart rate that we measured for each participant. Onda. We have body temperature for each participant and then back to the initial age pluck. So this as a quick way off, making multiple plots, using many different variables and letting our do the work for you. Iterating through all of those five variables, Andi comparing them to the difference measure. So instead of writing out plot five times we have created the for loop Onda. We can get our to do that plotting and all of that work for us. So do stick with it. In the next video, we will be creating our second user defined function to a lie was to do analysis and particularly regression analysis all these continuous variables Teoh Answer Mawr involved questions about them, so I will catch you in the next one.
18. Creating a custom function for statistical analysis (correlation & regression): hello and welcome back in this video. We're going to define another function and our our code. And this function is going to allow us to do analyses all the continuous independent variables and look for relationships between those independent variables on the dependent variable or the difference measures. So as I did before, I have already pasted in this function, and we're going to go through and discuss it one line a time. But before we do that, I want to show you what we're going for, what the function is actually going to produce. So let me maximize the console on Let's look at it on the plot that we get out off this function. So the first thing that this function does is it assesses that Pearson's product moment correlation. So we get a Pearson's R Value off this function. Pearson's Correlation or Pearson's R is a value between necked, if one on positive one. If the value is positive one, it means there is a complete linear correlation between the independent variable and the dependent variable. If the Pearson's R value a zero, it means there is no correlation whatsoever. And if they're Pearson's R Value is negative one. It means there is a complete negative linear correlation between the independent and dependent variable. So when we run the Pearson's product moment correlation, we get first of all, a p value. And this tells us whether the correlation is statistically significantly different from zero or not. Whether there is a statistically significant positive or negative correlation between the variables on, we see here that the P value is less than 2.2 times 10 to the negative 16. That's a really very significant, very small P value much less than 160.5 So we can assume that the correlation here is strongly different from zero. But what actually is it where we come down here further to the actual correlation value that has been phoned and it's negative 0.558 So that means there is a fairly strong negative correlation between our independent variable and our dependent variable here. And what are we looking at right now? We're looking at age. The function has put this header up here for aged. Surrounded by the hashtags are pine signs. So between age on the dependent variable, we expect a strong negative linear correlation. So that's another thing this function is going to do for us. It's going to create a scatter plot, as you can see here in the plot window bottom. Great. And it's creating a scatter plot with age on the X axis. Remember, that's in months on difference. Measure our dependent variable on the Y axis and we can see here. If we just look at the black dots, the data points here, we can see that there is this downward slope that higher the age becomes, the more negative the difference measure tends to become. So it tends to be that the older someone is the younger they actually think they are. I guess you could say that the closer to death, someone as the farther away they like to think they are from it, which may make some sense. The other thing the function does other than creating the scatter plot as it draws a red line through the data that this as the fit of a linear model or a regression line. This is a best fit line through all of the data points, so we can visually see that this red line is decreasing from left to right, indicating that his age increases up the X axis. The difference measure is going down to become more strongly negative. Another thing that the function does is it adds, the adjusted R squared off the linear model, or regression above this plot. So there r squared and linear regression explains the amount of variance that the linear model is covering or is explaining. So this linear model is explaining a boat 31% off the variance within these data. So it's doing a reasonable job of explaining why we have this decrease the P value here as the key value in the linear regression. And if the P value here is less than 0.5 it means that the slope of this linear regression is significantly different from zero. So here we've got a P value that is less than 0.1 were also endless function out, putting the statistical output or analyses from fitting that linear regression and not assuring here. So now that we've seen what the function is doing, let's look at the individual lanes off the function on again. I encourage you to first of all, listen to my explanation off what this function is doing before actually starting to copy down the lanes of cord. You can always pause the video and do that, but I think it's best if you don't try to copy the cord while I'm actually explaining it in the first instance. So I've read in a comment myself here saying this function. As for analysis of continuous independent variables against continues dependent variables, we're calling it corn cone fun this time. So because we're analyzing continuous variables against the continuous dependent variable, I've said it's continuous, continuous, and I'm calling it fun for function. And again, we are assigning the function to that on giving it the three arguments each far name and response, just as we did before with the corn cap fund that we did in the previous video. First of all, we're going to create the header that kind of tidies up our statistical output, were going to say a new variable called Header, and we're going to assign to it using the paste function, a row off Hashtags coma name that we're getting from the second argument of the function coma, a second row of hash tags to make a nice divider and his header and the statistical a put, and we are going to print that header that we just made using paste function. Then we are going to print the results of a Pearson's correlation. So to run the correlation test, we use core dot test as the function and within the parentheses, we're going to supply it with the response variable. So our difference measure for each participant and then we're going to supply at the continuous independent variable that we want to assess the correlation with our difference measure. So, for example, in age, I am giving the difference measure here. And then I'm giving the age of participants to this core dot test function. And then I am printing the output off that function. All of the statistical analyses two D console. The next thing we're doing is creating the plot. So what we're saying plot function, parentheses, each of our. So we're giving, for example, the age variable where they in saying that the why variable in this plot is going to be the response that difference measure were then labeling the X axis with the name argument that we go into this function so that in this case would be age. And then we're saying, why love? Or why label equal to difference measure? That's always going to be the difference measure, because the white label is always the dependent variable or difference measure, no matter what variable were reading into this function. And then we're saying type equals P for we want a point plot. We want points or a scatter plot on, then using an additional function called PCH, and I'm saying it's equal to 19 that determines what kind of points I get. So PC H equals 19 gives me filled circles. So I will let you read the help on Plaut and look up what other PCH values you can use. You complete run with that if you want to. But PCH equals 19 gifts. Is these filled circles And then I'm saying, See, oil equals blacks. I'm saying the color is equal to black. That's why we're getting filled black points on this scatter plot. The next thing that we do and this function is we fit a linear model using the L am function, and we're giving it the response, the difference measure and we're saying, Tilda, the variable that we're giving it in terms over the independent continuous variable. So, for example, if we're using age, it would be the difference. Measure Tilda, the age variable that it would be giving to the linear model function. And then we're assigning open of that linear model function to the variable fit. We can then add the fete of that Lanier model to the plot that we've made using that a bee line function that adds a lane to any plot. Onda. We give it the fit. So that is the output from the linear model that we defined two lanes previously. And then we're saying Coma on were saying the color is going to equal red and amber saying calma L W d equals two L W d is the argument for line. Wait So you, by default here, have one. I've made it a little bit heavier, a little bit thicker Red line by saying L W D equals two were then creating a summary off the linear model or regression fat by supplying to the summary function the fit that output off the function once again on and we're going to a sign open of that summary to a new variable called Fit Some. And then we can print, not fit some to the console so that we can see a statistical summary off our linear model Now, just like we did previously. I'm using a branching statement here to tidy up our P values. So again, if the key value as much less than 0.1 we want to just report that as less than 0.1 We don't wanna have loads of digits way off much past the decimal point. Andi, if it is not less than 0.0.1 I'm rounding off to four significant digits. I'm doing this using the fit some. So the summary of the statistical fit is not variable that we created a couple of lines above on and then using the dollar saying to access its attributes. And then I'm accessing the sea or year for coefficient. Attribute Andi. Within that, I am accessing the second rule on the fourth column, amusing square brackets to subset that coefficient attribute and fit some. That is where I will find the P value within the O Poot off that linear model fit function . So then, in this f statement, I can assess if that p value I'm pulling out off that coefficient is less than zero points user one. In which case, if it is, if that's true, I'm creating a variable called P value that is going to be less than 0.1 as a text string and quotes there. And if it is not true, I'm saying else P value is going to be assigned the four significant digit rounding off that p value that I'm pulling out of that statistical summary off that linear model fit the fit some that we created. The next thing I want to do is add this is a label to the plot in order to get the R squared on the P value up here on this plot so that people can see it easily. I'm going to create a variable called lab Tech star label text on. I'm gonna Same to that using the paste function I'm going to paste together the text string AJ are adjusted a d j dot and imputing our capital are carrots saying to for adjusted R squared colon. So that's one text drink. I'm gonna pace together with that. The row indeed fit some adjusted dot r squared. So no effects on that summary of the statistical fit. I'm using the dollar sign to cooler and attribute. That is the adjusted r squared. And it's actually called a deejay dot r dot squared within that statistical output. So that's how I access the R squared value. And I'm rounding that to four significant digits just to tidy it up a little bit. And then I'm using another coma on pasting P value cool on and then another coma to paste that together with the P value that has been determined up here in this previous F branching statement and then gonna add that to the plot. So I'm using the M text function. I'm supplying it the lab text that I defined previously, and I'm saying coma and giving it the argument Side equals three that tells our that I wanted to be on the top off the plot as opposed to underneath, or it either sites or let you look up the help on that. I then want to set the graphics device using the Leo Command a two by two grid that will be populated by four figures. Why am I doing that? Because this function that we're defining here actually is also going to do diagnostics on the regression plot. So if I go forward, a plot here is going to show you what it looks like. We get our diagnostic plot standard diagnostics for our regression. So we get a residuals plot residuals are there, enters in the regression we get a Q Q plot. What shows whether the regression residuals are normally distributed. That's one of the assumptions of regression that they have to be normally distributed. So weaken test here on visually loot to see if our regression is actually 50 dokey if it is violating any of the assumptions off regression or not. But to get all of these plots together in one single plotting window, we have to change the graphical device. And what I'm doing here is saying in the layout function, I am supplying a matrix where I specify, I'm goingto have four plots. 1234 on. I'm putting them in a two by two matrix to rule two column matrix to In order to get needs , four to be displayed next each other, as opposed to going on individual pages. I am using this matrix command with the layout functions. I'll let you look up the help on that. But Leo is a very useful way to split up plots onto a single page. Where then, saying plot defects or plot the fete off. These statistical model on not will plot by default these diagnostic plots. And because we've already set that Leo to be a two by two matrix, it will put them for to a page. And then the last thing we're doing in this function is setting the graphics device back to its default. One plot, one column, one rule, one plot per page. So that's the function there. So I encourage you to listen to the explanation again on Look up the help on each one of these new functions that we're using and read these comments that I've left throughout this function again, this is very similar to the previous function that we defined for categorical variables except no, we're making scatter plots, a supposed to box plots, and we're fitting a linear model on looking at its diagnostics. So do please join me in the next video where we are going to actually run this function, we're gonna place that incite a four loop on. We're going to see graphs off our continuous independent variables and look at whether they have any statistical relationship with the dependent variable. So I will see you in the next one.
19. Challenge #2 - Write a for loop to run the custom analysis function: Hello. And welcome back in this video, I would like to set you your second challenge. I'd like to test your looping skills a little bit more. And what I'm gonna ask you to do is create a for loop that generates through the continuous variables in the data frame that we created, I want you to say, and one variable at a time to the new corn corn fund that we just defined in the previous video. And to give you a hint in this task, we did this previously for categorical variables. So that's far. Luke will look very similar to go ahead on pause your video. Now, before I begin to explain how to do this. So welcome back, everybody. I hope that you managed to create a four loop, and it's working OK for you if you didn't don't worry about it, for loops are very tricky. So I'm going to go ahead on right the four loop with, you know, So the first thing I'm going to do is create a comment to myself. I'm saying here that this is a loop for sending continuous variables to continuous analysis function that we defined in the previous video. The first thing that we want to say as four. This is a four loop, and we're going to say I and one through land off the life data dot corn. So this is the continuous variable data frame that we previously created. Life data dot com You can see it's up here. In our environment is 200 observations off the five continuous variables age, body, tamp, heart rate support and health all of those continuous numeric variables. Next thing I need, of course, as a curly brace, I start the for loop with the curly brace. Remember that in the first part of this were saying I and 13 length life data dot cone remember with landfill of life data dot Conus How many variables air in there? It's going to be five, and we're saying one through five and then I is going to equal one through five. So in the first iteration of this loop, the first time it runs, I will equal one second time of legal 234 and then five. AnAnd. It will terminate. So the first line of this Luton in dense is of course, because that visually shows that were with end Ah, looping structure We're gonna define available, called each bar. You can call this whatever you want. I'm just gonna be consistent here and call it each more. And I'm saying that that is going to equal life data dot con so that continuous variable data frame. And I'm sub setting that using the square brackets. Andi I again square brackets a lotus to subset a data frame. The first value in the square brackets tells is the road on the coma separates the column. Number is the second number. So here, when I say life data dot colon square bracket, I am putting nothing for the Rose. So we're a turn all of the roads. And then after the coma competing I so that will extract out whichever call him off The iteration of the loop were on. So the first time through the loop I will equal one. It will extract the first column, which in life data dot corn as age second time through the body temp and heart rate eso on where I'm going to define a variable called name. We're going to say that's equal to vars dot corn and we're gonna could I In square brackets vars dot corn we created previously we can look for an hour environment here as bars dot con. It's a character vector with five character strings and age body temp, heart rate support on health. So as we iterated through the salute, I will equal one first of all so it will collect the name age that it will equal to so polite body temp. And so then we're going to send these new variables that we created to our corn corn fund on the 1st 1 is each of our we're sending. That's the first argument to Cone cone fun or then going to say name. And we're going to say a life, a data Dr Devi, that's the dependent variable or the difference measure again here as an environment life data dot devi as the 200 participant different scores? No, this will call the function and will send these variables to it. But remember, before we run this for a loop, we have to have already run the function once and I have already done not here, but you will have to having written all of those lines of that function that we discussed in the previous video. You will have to actually run that. So go ahead and run that so that it is in ours environment, you know, See, in the environment we have subheading of functions. We have con con fund knows as well as the corn cat fund that we defined previously. So no, we can go ahead and run this four loop on It has gone through on output the statistical summaries to the console. So weaken scroll up a little bit on we see we've got statistical Kupfer age a p value a correlation value. We've got error output from the linear model. Then we've got heading for body temperature. We've got P value correlation analysis here on DSO on. We also have the plots. So we go back one here and we see that this is the health plot, not significant p value that linear regression lane is not different from zero. It's flat. So we can't say that health is having any relationship to go back through more diagnostics to the support variable key value not less than 0.5 So we can't say that not regression line is significantly different from zero. So there's no relationship between support and the difference measure. In the data set, we can go back through more diagnostics to heart rate and P value, not less than point to your five. So no statistical relationship between heart rate and the difference measure go back through more diagnostic plots to body temp again. P value not significant, not less than point to your five. So no relationship between body temperature on the difference measure back through mawr. And this is what we already saw. The age variable there as a statistical relationship between age onda difference measure. So we can say that older people, I tend to think that they are younger than they actually are on. Younger people tend to think that they are older than they actually are. So on that note that concludes the first section of analyses and code rating. I thought that you found it useful, and I hope that you phoned the analysis that we've been doing somewhat interesting. The next video will conclude this section with a summary off. What we have learned both in terms of rating are could as well as what we've learned in terms off statistical analyses. And then we'll move on to the second section of the class, which will focus on time series analysis. I will see you then.
20. Course Review: Hello. Congratulations to you on making it through the course. I hope that you've learned many, many new things and that you found it interesting along the way. I hope they're applying all of these new things to a riel life data set has really helped with your understanding off how to use our unimportance off what we're doing to complete really analyses tasks. Having said that, let's review on talk a little bit about all of the things that we have learned in our So here's what we did in the first section of the class we loaded data on. We changed the column names. So we go in a data set from a file on your computer on DWI. Change the columns. We cleaned that data, so we checked for errors, and we also fix those errors. Many of the others were associating with missing values, but we also had unrealistic values for things like body temperature on for aging months. So we learned to account for those findem Andi, fix them. We also learned how to create many different basic plots. In this first section of the class, we created bar floats for the proportion off one factor very sees another factor in our data set. We created box plots to show the differences between different variables and our data set. We also encountered hissed A grams, and we also made scatter plots so many different basic plots that will get you through a lot of different descriptive analyses. We also created multi plots or multi part plots, where we graft numerous plots on the same page. This could be really helpful if you are submitting something for a presentation or the report or a publication. We also looked out putting figures on statistical analyses to pdf files and to text files. So as we went through, in particular, those functions and those looping structures toe autumn a record, it was very useful to have plots and statistical eight put that we're producing go directly to pdf documents on to text files. We worked with many different types of data variables and are including vectors, which are groups off a single element. We also worked with factors which are for categorical variables that have different values with fend them. So, for example, the sex variable we had males and females, so we worked with that candid data and we also created data frames and worked with those. We learned how to create column names. We learned totus subset data frames, so that was very useful. Where celerity to create on function so through it are were using functions all of the time . And most of those functions are coming from the our program itself or from packages that we will install throughout this class. But we're also learning here misperception how to write our own functions to perform very specific analyses tasks on our data set. We also learned how to use looping structures to automate our code to create multiple analyses and multiple plots for different variables at once, without having to write out many, many more lines of repetitive cord. And we also learned how to use if statements toe over cord to branch into differing flows to account for differences and types of analyses that we have to perform. We also performed many statistical tests. We created descriptive statistics. We learned how to summarise our data using means medians on other measures of generals, descriptive statistics, and we also learned today to perform inferential statistical tests that tell us something a bit more in debt for better data. So we used T tests on Nova's. We used correlation analysis, and we also performed regressions and looked briefly at the diagnostic plots for those regressions. And finally, perhaps most importantly, here we analyzed a riel life data set. So we took real data from a study that I conducted years ago and we actually performed basic statistical analysis on it. We created plots of the data, we visualize the data and we also answered questions about these data. So let's look briefly at some of the things we found out. This was one of the box plots we created. This is showing the marriage independent variable on the X axis on. We're showing the difference measure on the way access. So again, we've got not married and we've got married. So individuals that were not married on the left and those that were married on the right and then we have the difference measure. And remember, if the difference measure is negative, it means that individual thinks that they're younger than they actually are. They perceive themselves is younger, and if the difference measure is positive, they perceive themselves as being older or further along in their lifespan than they actually are. And what we found was there is a statistically significant difference between married and not married, and we found that those who are married are actually more likely to perceive themselves as younger or not as far on in their life span. We performed a T test to see if there was a real, statistically significant difference between these two groups, and we found it was significant. So we can say statistically that those who were married do you really perceive themselves to be younger? We also find something similar with employment status. We've got unemployed on the left and employed on the right, and we found that those that are employed are working 10 to find that they are younger or feel that they are younger than they actually are. And again, we performed a T test unless and find that it was statistically significant with a P value less than 0.5 Then we came to education level where we have high school education, undergraduate education in graduate level education, and we find here that the more educated you are, the more advanced you are in your schooling. The more you perceive that you are younger than you actually are. So we did on a nova test of this because we have more than two groups here. We performed an analysis of variance across these groups and we find it was statistically significant that there is in fact a difference between these groups and that the more education you have, the younger you believe you are. And then finally we looked at some of the continuous variables we looked at age of participant. In months on, we find that the older the participant, is generally the younger they feel that they are. And again, we performed a regression analysis on this and found that this is statistically significant . This negative correlation here is statistically significantly different from no correlation . So we're finding here that younger people tend to think they're older than actually are. An older people tend to think they're younger than they actually are. So we've performed ah, lot of fairly basic yet very informative analysis here, and I hope that you found this interesting and I hope that you've enjoyed learning how to do all of these things and are