R for Data Science: A Practical Introduction | Donovan Harshbarger | Skillshare

R for Data Science: A Practical Introduction

Donovan Harshbarger, STEM Enthusiast & Generative Artist

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
9 Lessons (34m)
    • 1. Lesson 1: Introduction, or Why R is Awesome

      1:18
    • 2. Lesson 2: Installing R, RStudio, and Tidyverse

      2:37
    • 3. Lesson 3: Importing the Data

      4:01
    • 4. Lesson 4: Graphing with Quickplots

      5:00
    • 5. Lesson 5: Summary Statistics by Group

      3:24
    • 6. Lesson 6: Custom Subsets with Filter

      4:48
    • 7. Lesson 7: Turning Characters to Factors

      4:19
    • 8. Lesson 8: Reporting with R Markdown

      6:46
    • 9. Lesson 9: Recap and Project

      1:32

About This Class

cc5f06c1

R is one of the most popular platforms for data science, largely because you can be highly productive with only a little code. We'll start with a simulated data set from a company's sales force, and in almost no time we'll be mining the data for the knowledge that HR needs to hire the best employees. We'll learn how to get summary data for different subgroups of employees, as well as how to create simple graphs for quick insights. We'll also learn to create reports of our findings that we can easily update when the data changes.

You don't need any programming experience or prior statistics knowledge for this course. I'll even show you where to get R and RStudio, the two pieces of free software that we'll use.

Transcripts

1. Lesson 1: Introduction, or Why R is Awesome: Welcome to our for data science. I'm Donovan Hershberger. After studying math and physics in college, I've done all sorts of things, including writing Web software and computer generated art. I also taught high school for 20 years where I did a lot of statistics work to find out how our initiatives were working when I taught high school. I like to keep my teaching style very practical and get students doing interesting things as fast as possible. And I'm going to try to do the same for you in this course for data science. One of the best ways to get up and running quickly is with our in our studio. Are is a programming language designed especially for statistics and data science, and it's made to let you jump in and do useful things without writing a lot of code. Don't worry if you've never programmed a computer before, this is a beginner's course, so I'm not assuming any prior experience. We'll also use our studio because it makes working in our even more convenient for newbies and experienced coders alike. Discourse will look at data from the telephone sales force for a fictional company, and we'll explore the status set to discover the traits that the successful employees have . We'll also learn to report our findings from within our studio, which makes it easy to update them whenever we get new data for the class project, I've provided a second data set that you could analyze and report on using the techniques you learned in this course. I hope many of you will post reports to the project gallery for others to learn from. 2. Lesson 2: Installing R, RStudio, and Tidyverse: let's start installing the software. The easiest way to get both pieces of software is to go to www dot our studio dot com. When you get there, click the our studio download link. You want the version of our studio desktop with the open source license so it's completely free. We'll start by installing are even if you already have are installed. Please make sure you have a recent version, so we'll follow the link to download our Now. Just select the right version for your system and the installation should go smoothly. And once that installation's finished, we head back to the our studio page. Now select the installer. That's right for your system for our studio. Go ahead and install that again. It should be pretty straightforward. And now we'll show you what our studio looks like when you fire it up. When you first start our studio, you should see something like this were almost but not quite done installing things. We have to install one more. Our package called tidy Verse and our package is a program written in our that somebody else has already written and made available to us to install it. We go to tools and select the first item in the menu installed packages. I could just type the name of the package. I want Tidy, Versed. Noticed there's an auto complete than I can activate by using the tab key. Then I just click Install. This is what my machine looks like after the installation, but it's a little deceiving because I already had tidy verse on my machine. You should expect to see a lot more verbose messages than I have on my screen. And there's just one more thing. Before the next lesson, I'd like to draw your attention to the help tab in the lower right hand quadrant, but you have Here is a Web browser within our studio with searchable help. Some people prefer to use Google. Some people prefer to use this help system. Use whatever you're comfortable with. Stay tuned for the next lesson where we load our data 3. Lesson 3: Importing the Data: welcome to less than three. Where we import our data. Please start by downloading the data set from the attachment area in the class projects section of the course page. You want the e m p s, i m dot c s v. And then if you haven't done so already, fire up our studio. If you're jumping in the middle and haven't yet installed our studio, please go back to lesson too for instructions. The easiest way to import a data set from your hard drive is from the Environment tab in the upper right hand panel. From here we click important data set notice that we have many different types of files that we could import here. RCs V file is a text file and we need to make sure we use the reader, not the base. Using the reader option will ensure that our data plays nicely with some of the functions that we use later. I need to browse to find the file on my hard disk. After we've selected our fire, we get a convenient preview. And then we just clicked the import button. Notice that several things have happened now, one is that our studio has added a new tab in the upper left where you can view the data just like you could in a spreadsheet. Another is that the variable name E M P Underscore s I am has been added to the environment tab. Anytime we want to refer to this data set in code, that's the name will use. You can see this in the consul in the lower left, where the View Command was added for us, it tells us to view and then in parentheses E m p underscore s I am the name of the data set were going to view. If you're new to programming, notice this pattern, we're going to use it a lot to execute a function. We use the name of the function in this case view and then in parentheses. Whatever information dysfunction needs to do its job in this case, the name of the data set to view. Now I'm just going to get this view tab out of our way so we could look at the data some other ways. And now what's let me go back to what I really want here. We need to load the tidy verse library, you might be thinking, Wait a minute. Didn't we just do that in the last lesson? No, What we did in the last lesson was installed the library. That puts it on hard disk, but we haven't yet made it part of our program. If want to use the tighty verse as part of our program, we need to load the libraries. That's what this command is doing. After you hit enter, you'll see a few messages. Don't worry about them. They won't really affect us. The reason we wanted the tidy verse library is so we could use its glimpse function. That's another way to see the data again. The syntax is the name of the function glimpse and then the information it needs to do its job. In this case, which data set were glimpsing. Now we see our data in a very different way. It may not look as convenient as the grid based view we had before, but this form has a couple of advantages. One is that each different variable has its own row, so even if we have 40 variables in our data set will still be able to see them all. Unlike when they're arranged horizontally. The other is that with the Glimpse function, we can look at the data types. The D B L stands for double. That essentially means their numbers that could hold a lot of decimal places to CHR stands for character. That just means it's basically text to the computer. No particular significance, and if you stay with us for the next lesson, we'll start making some graphs. 4. Lesson 4: Graphing with Quickplots: to recap in the prior lesson. We loaded our data set and we loaded the tidy verse library and got a quick look at the data set. Now we're going to start some graphing. But first, I'd like us to open up in our script file toe work in, because that makes it easier to save our work and reuse parts of it later. We'll do that from the file menu. We select our script. So I now have an untitled script window in which I can type commands. And before I forget, I'm going to go ahead and save that before the next screen shot. Okay, I just did my save off camera, and now we're ready to make a graph. The function we want for this is called Q plot short for quick plot, and it needs three pieces of information. The variable that will be on the X axis, the variable that will be on the Y axis and the data set were using suppose we want to explore whether the revenue made by each sales person depends on how extroverted they are. So the variables we want to use our extra vert on the X axis and revenue on the Y axis. Those variable names are the same ones we saw when we glimpsed the data in the prior lesson . Notice that we separate the different pieces of information or the different arguments in other words, by commas and that we have to get the X and Y in the right order after the X and why. There are several arguments that we may or may not use, and we need to specify those with keywords. In this case, data equals and then r e m p underscore s I am the data set were actually using. I'll expect to see the output in the plots tab. So let me switch to that. Unlike when we were working in the Consul down below. We need to actually tell our studio to run the code that we've typed to do that all highlight the line of code I want to run and then click the run button and we now have a graph. Unfortunately, we don't have a terribly illuminating graph. Well, we can see that the handful of highest revenue earners had high extra version. We really can't tell much at all about the other 1400. We just see a big bunch of dots so clustered together we really can't tell what's going on . Fortunately, we're not stuck with that. We just need a different type of plot. So let's do a box plot. Instead, it looks very much like the same plot we just did, except that we need to specify geometry E or G E O M equals box plot. Notice the single quotes around box plot. That's because box plot isn't a variable. It's what we call a string. That means it's just supposed to be interpreted as text. Once again, all highlight the line and click run. And here's our box plot. So what are we looking at? The's horizontal lines in the middle indicate the median not the average, but the median. That means that half of the revenue values are above that line. Half of the revenue values are below that line. The white boxes indicate the middle half of values, so the next 25% above the median in the next 25% below the median. Since those medians aren't really very far apart, we'd probably conclude that it really doesn't make much difference whether our sales force is introverts or extroverts. If you'd like a little more detail on where the different numbers fall within those white boxes weaken. Do something called a violin plot. I'll cut and paste the previous line and change box plot to violin. And when we run this one and now we can see that the biggest clustering of our dots is very near where the median was before we see the wide areas just above 200 there's one last thing I'd like to draw your attention to. In this lesson. Notice the order of high, low and medium those air alphabetical orders, which seems a little counterintuitive. We might have thought that Low Medium High would make more sense. That's because the extrovert variable has the character data type, which means L O W an M E D and H I G H r. Nothing order the computer than a bunch of letters to fix. That will need something called factors, which will cover in less than seven 5. Lesson 5: Summary Statistics by Group: Welcome back. In our last lesson, we use the quick plots to explore data set, and this lesson will be using the summarized function are summarized function from the tidy versus library uses something kind of strange looking called a pipe operator. The syntax for the pipe operator is percent sign greater than percent sign. Its purpose is to take one variable, such as our data set or the output from another function and send it on to the next function. In other words, the contents of our data set are being supplied to the summarize function. In our summarized function, we can create some new variables based on that data set in this case, I create the variable A V G and I set that average variable equal to the mean of the revenue and, as usual, highlight the line and click run. Even though we're working in the script file, our results still appear in the consul down on the lower left. So this average label that we created has a type of DB el double wear number, as we expected, and it tells us that the mean revenue of the entire sample is 335. Besides the mean, there are several other functions that we can use with the summarize. Among them are median and count. As it turns out, though, I've made a little mistake and what I've typed here. But I'm going to leave it in for educational value rather than edited out. So in the console we see the message telling us that we were not supposed to supply any arguments to the end function. In other words, nothing was supposed to be in parentheses, so I'll go back and edit that out. By the way, the function from that we're using for count is called end because that's commonly used in statistics text to indicate the number of items in the sample. And now our is much happier with us, giving us the values for the average, the median and the count. You may notice that the type of count is I N T that stands for integer are does that because it knows that we can't possibly have 1/2 an item. So what uses the data type of integer for account? Now I'm going to cut and paste the line, and we'll make one more very useful modification you may have noticed that our previous summary didn't really seem that illuminating because it didn't differentiate between different groups. That's what we're going to fix now. The tighty verse library we've been using also supplies us with something called the Group by function. I'm going to add it in between our initial data set and are summarized function. So we have our data set being piped into group by and we're told a group by educational level and then the output of the result of grouping by education then gets piped into are summarized function. Thanks to our group by operator, we can now tell pretty quickly that there appears to be an advantage to hiring members of our sales force with a college degree. Stay tuned for more waste. Look at different subsets of our data with the filter function 6. Lesson 6: Custom Subsets with Filter: welcome to lessen six, where we learn to look a certain subsets of our data using the filter function. Let's start by creating a subset that includes Onley this members of our sales force that had no prior sales experience when they were hired. In other words, prior sales equal zero. First, I'm creating a variable called No Prior so that we can use it to refer to this subset again later. And I'm setting it equal to the expression that follows. What comes next looks a little bit like our previous lesson. We see the data set piped to another function in this case called filter. The argument to the filter function Prior sales Equal zero is a description of which items we want to keep. If you're a programming newbie, you probably think I type something wrong. You're looking at that double equal sign and wondering why in the world there two of them. The difference is that the single equal sign that we used after no prior causes the variable no prior to take the value that followed the double equal sign test. Whether something is already true is prior sales already equal to zero. For that particular item. So now we run the line as usual. And But when we look for our results in the console, they don't look very interesting. Well, all that happened was exactly what we told the computer to dio create a variable, no, prior and a sign of value to it. But we haven't actually looked at that variable yet. We just assigned to it so I can fix that pretty quickly. If I go back and simply type in the name of the variable and run, then I'll find out what's actually in that variable. All right, much more interesting. Now, Now we see a partial list of our data set and Onley that part of the data set for which the prior sales equals zero, along with the notation that there are over 400 additional rows in the status set to see. Now, suppose we want to view the sets with our top 10% and bottom 10% revenuers. First, we need to find the cut off points for our top 10% in bottom 10%. For that will use the Kwan tile function. I'm creating a variable Q 10 toe. Hold the cut off for the bottom 10%. The 0.1 is the decimal form of 10%. It's where I want to cut off to be. What might look odd is this dollar sign. Remember that we did not pipe the data set into this function. This time. That means that quantum function does not automatically know that we're using the Epsom data set. And so we have to specify that Epsom Dollar signed revenue means we want the revenue variable from the Epsom data set here. I do the same thing for the 90th percentile. Now I'd like to view the values of Q 10 and Q 90 to see where those cutoffs are. Now. I'm going to do something convenient you haven't seen yet. I'm going to highlight all four lines simultaneously that we just typed and run them all at once. Now we can see that the bottom 10% of our employees had revenue less than 206 while the top 10% had revenue greater than 4 82 Now create a variable bottom 10 toe hold a subset of people whose revenue were in the bottom 10%. That is less than Q 10 and I'll repeat for the top 10 subset whose revenue is greater than Q 90 to compare the two groups, I'm going to use the summary function not to be confused with the summarize function that we saw earlier. For this. I simply need summary. And then which data set, I wish to summarize. I'll repeat for the top 10 and then select the four rows that I want to highlight. Now we have our summaries, and we can compare the top 10% in the bottom 10% according to various variables. For example, we can see that even in the first month of employment are best. Employees were somewhat noticeably better than our weaknesses to 10% of employees. But you also may notice something less helpful about this summary. Notice how we have no useful information whatsoever for any of our variables that had a data type of character, and the next lesson we'll see how to use factors to fix that 7. Lesson 7: Turning Characters to Factors: welcome to lessen. Seven. Where we learn about factors are way to fix some of the things that don't quite work as we'd like with the character data type, we'll use the extrovert variable, which can have values of low, medium or high in the character type. As our example, I'm going to create a variable called X levels that's going to hold the strings that I want to be the labels that C is for combined. What I'm doing is creating a list of the levels that I want that variable tohave in the order. I want them to be low, medium high rather than the alphabetical order that the character data type uses. To make use of those new levels, I'm going to create a new column new variable in our EMP SIM data set called E X T factors . I add that new variable to the data set by using the mutate function, I pipe MP sem into it. Specify what I want my new variable to be called in this case, e X t factors, and then show how it's constructed. In this case, I'm going to use the factor function to convert the character type to the factor type. So I specify which variable it is that's being converted in this case, extroverts and then specify what the levels will be in this case, referring back to the e x t levels variable that I defined a moment ago. Also note that I have started this line with E. M. P. Sim equals. That's because I want to make a permanent change in the e. M P Sim data set to reflect this new column. Now we'll use the glimpse function to see how things have changed, so I'll highlight and run the last three lines of the script. Now we can see that our data set includes the new column and that it is labeled with the F , C T or factor data type instead of the character data type that our original extrovert had . We can also see that there are no longer quotes around low, medium and high in the factors that tells us that those air factors and not just ordinary strings or, in other words, ordinary pieces of text with no real significance to the computer to see how this is fixed . Some things. Let's redo our bottom 10 filter from earlier. Since the underlying data set has changed, we need to rerun that filter to get a new set for bottom 10 and let's copy and paste that summary line as well, Then highlight and run these lines to see what's changed. Remember that really useless information are summary of the extrovert variable used to contain our factor. Version of the variable, however, gives us useful information in the summary. We conceive the distribution of low, medium and high extroverts status among those employees in the bottom 10% of our sales force. We can see another benefit of factors if we go back and re run our Q plotline. Except that this time all used the XT factors on the X axis instead of extrovert. Now our labels are displayed in the more intuitive order low medium than high, as we specified in the factor levels. This happens because they're now factors with specified labels as opposed to arbitrary character text that has to be done alphabetically. 8. Lesson 8: Reporting with R Markdown: welcome to lessen age. We've drawn all these great conclusions about our data in the 1st 7 lessons. Now what do we do to report our findings for this? Let's look at our mark down. I'll start by creating an R markdown file. Don't stress too much over the title of the report. You'll be able to change that later. I'm selecting Microsoft Word format here. Noticed that to do the pdf you need to do a little software installation. I never worry too much about whether I have that software, because I can always create a PdF from within Microsoft Word if I wish. And now in the upper left, we have a new are markdown file with a lot of things filled in for us. There's the title of the document that we filled in a moment ago. You could change that if you wish. Here. This line contains some global settings about how our markdown will work for us. I'm going to make one change here to declare our output by adding message equals false here . Notice false in all capital letters. I keep a lot of un interesting things out of our final report in marked down of any flavor are otherwise to Hashtags and Row is the symbol used for a headline. So I'm going to change the headline to something that matches our content better. By the way, if I want to use a subhead later on, I can do that with a single hashtag symbol. And I'll replace this one interesting example text with something more useful for our purposes. Our code can go within these code blocks here that are marked in gray, and they're marked at the beginning, in the end, by three back ticks. Those back ticks are the key in the upper left of your keyboard. I also need the braces in the are within that to indicate that it's in our code block. But I don't need the cars. That's just on example Section name. So I'm going to remove that. I'll also delete the existing example code. Now we have to add our own code, including some things that wouldn't necessarily be obvious. Fortunately, even some of the less obvious things are all available through the history tab. In the upper right in the History tab is a list of the last 500 commands that I've typed. Anything we've used as we've explored our data that we would now like using our report will be available somewhere in that panel. As we're assembling this, our code block, we need to keep in mind that the are marked down script has no memory of what happened before. So everything that we've done that's relevant to our analysis has to be included within that block, starting with loading the data set. Unless, obviously yet that means that we need to load the library reader, which is used to load the C S V file. So all select the line where we load the reader library and I can move that into our markdown file by clicking to source. And there it is. And then the line that loaded the data set from are Hard disk and said it equal to the E M P sim variable. Now, you may be thinking to yourself, Wait a minute. I've watched this entire course, and I don't remember ever typing that. You're right. You didn't. Those lines were generated behind the scenes when we used the import data set function from the menu. So even though we didn't generate that line with our own fingers. It is here for us to use in the history. So if we had to reload the reader library, it may occur to you that we also will need to reload the tidy verse library. You'd be absolutely right. So we'll do that now. We'll include the graph that shows why we concluded that the extroverts status was not terribly important. But that didn't look right until I did the transformation into factors. So all include that first and then the code to generate the plot. And let me get rid of all this extra pre inserted text below the code block. The command that will actually create this report from our our markdown file is the NIT command. Unsurprisingly, you'll be asked to supply a file name and where you'd like to save the file. And in a moment I have my word file. It even opened it up automatically for me. We see the title we chose at the top than the headline, the regular text than the R code that we were using and then the graph that was generated by the Q plot. If our our code block could contain requests for summaries or something like that. We would also see that in the report. Now, depending on who your audience is, you might or might not want all that our code that generated that graph. If we don't want the reader to see that we just have to change one setting, it's the echo equals true that we saw on our global settings. The echo is what tells marked down to go ahead and include the R code in the output. Since I don't want that will change the true to false, then it again and there it is without the R code. 9. Lesson 9: Recap and Project: So before you start your project, let's remind ourselves what we did after we loaded the data set. Don't forget to load the tighty verse library. If you don't do that much, nothing else we did in this course will work to get a quick look at the available variables and data types, we can use the glimpse function to see a graph. We can use the Q plot with scatter plots. Box plots were violin plots group by and summarize. Let us do things like view the mean revenue according to different subgroups such as education or extra vert and filter. Let us look at particular sets of our data, such as top 10% bottom 10%. And don't forget when those pesky character data types are making it hard to see things the way you want. You can turn them into factors for your class project. I've given you another data set very much like the one that we just used is called Project Out CS Fee. So try to answer some questions about your employees with that data using the techniques we've just learned in this course, then used the are marked down feature, and that we saw in less than eight created the document with your conclusions and I encourage you to post him to class. Gallery. Thank you for participating in this course.