Scrape Reddit Comments and find Reddits using R | Mark Gingrass | Skillshare

Scrape Reddit Comments and find Reddits using R

Mark Gingrass, Citizen Data Science

Scrape Reddit Comments and find Reddits using R

Mark Gingrass, Citizen Data Science

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
4 Lessons (27m)
    • 1. Scrape Reddit Intro

      1:19
    • 2. Download R Condensed

      1:15
    • 3. Install R and R Studio on Windows

      9:34
    • 4. Scrape Reddit Comments R ExtractoR

      14:29
  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

25

Students

--

Projects

About This Class

Have you ever thought he Reddit comments were hilarious or insightful? Do you want to copy them for use elsewhere? Now you can.

How to scrape Reddit comments using R and the ExtractoR package. All software is free. I show you how to scrape any particular Reddit as well as finding and sorting Reddits based on search terms.

The first video is a very fast, to the point, instructions on how to download R and R Studio. If you have them installed already, you can skip them. 

The second video is a more in-depth video on how to install R and R Studio and run your first script. If you know how to do this already, skip to video 3. 

Download R: https://www.r-project.org/

Download R Studio: https://rstudio.com/products/rstudio/download/

Meet Your Teacher

Teacher Profile Image

Mark Gingrass

Citizen Data Science

Teacher

Class Ratings

Expectations Met?
  • Exceeded!
    0%
  • Yes
    0%
  • Somewhat
    0%
  • Not really
    0%
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Your creative journey starts here.

  • Unlimited access to every class
  • Supportive online creative community
  • Learn offline with Skillshare’s app

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

phone

Transcripts

1. Scrape Reddit Intro: in this project you're going to be scraping Reddit Using are and what you're scraping is your first going to scrape the comments of any particular reddit that you find on reddit dot com. If you haven't heard of it, you'll find the comments to be hilarious, and sometimes you might want to scrape them. You can scrape the commerce and use them for further analysis. You can use them and put him into a a text to speech synthesizer, something like that s oh, you're gonna learn how to search. Read it, using a programmatic approach with our and our studio, which is free. If you have are in our studio already installed, you can skip the first video. If not, please watch the first video and in the following video, you do not need to be a programmer or any hard core mathematician to do this. It's very simple, so I'll show you how to scrape. Read it. I will show you how to find Reddit post programmatically as well. So let's say you want to find Trump or posts that have the title trump in there somewhere. I will find all of those post with our that meet a certain threshold. For example, find only the post that happened today, Um, that have over 20 comments. You're gonna do all that within a 15 minute video, So stay tuned and enjoy your project is to just scrape the content that you want and show it off. 2. Download R Condensed: Let's get started. First, let's download our and then we'll download our studio and we'll go from there. So go to our dash project dot org's Click on the download our link and find a site. Um, your site that's close to you and click on that U S. A. Is close to me and click on your platform of choice. This works for Lennox, Mac and Windows. I'm on a Mac, so I'll click on the Mac and then you might have a page. It looks something like this. You need to find the download link. Usually, this P K G is the latest release. Click on that and that'll download the file you need that'll download the value, need to install our install, are and then come back. Once you're done installing our goto our studio, which is our studio dot com or you can do a quick Google search on download our studio and the 1st 1 that pops up is our studio dot com slash products, etcetera. You want to probably download the free version here our studio desktop. So click on that and follow those download steps Windows, Mac and Linux as well. Simple download steps Get that one started and you should be ready to go 3. Install R and R Studio on Windows: we're gonna download our in our studio and install them both on this fresh, clean virtual machine Max off windows. So I have Mexico edge down here. We're gonna open that up, and instead of typing in the u. R. L will just go ahead and go to google dot com whatever search engine you want to use and just type in download, our and the first site should be the our project. Our dash project dot org's Click on that. And then this site hasn't changed in years, for the most part. So the top part looks very similar and it has a download link here to download Are so click on, download our go down to a mirror site that's close to your location. I'm in the USA, so I'm going to click on that. And finally, you have a prompt to pick your platform of choice Lennox Mac or Windows. I'm gonna go ahead and download this for Windows. I'm running windows on top of my Mac book in a virtual machine. Once you click on that, you get to this page. Then it tells you to install are for the first time click here and you finally get to the download link right here. Download our 3.6 point one at the time. And the latest stable release should be this one here that it that it gives you by default . There are developer versions as well. You can download if you're interested. So I clicked on that. And with this, I'm just gonna go ahead and save it as, and I'll put it on the desktop so that we can find it easy. Now that's download, and I can clothe Well, instead of closing this, let's go ahead and download our studio while we're at it. We're not gonna install it yet, though. So again, just go to google dot com and type in download our studio Very simple. And it's er studio dot com. I like to use Google because sometimes when I tell people to go to our studio, they might misspell it or something like that. But you can go directly to the link if you want. Most likely you're gonna want this free version, the art studio desktop. And so when you click on download here, it will bring you down to installers for supported platforms. As you can see it says Windows Mac and you bun to Lenox. I'm again using a Windows machine in a virtual box, and it is the very top one. We're gonna go ahead and download that while that's downloaded. Well, let's just go ahead and save it. Actually, I'll save it as and we'll put this on the desktop as well. And you really want to install our first because our studio relies on are being installed. So here's my our version 3.6 point one on the desktop. Let's go ahead and install that while our studios finishing up the download. Of course, I'm gonna click. Yes, I want to install this app, pick your language and run through the steps. Read all the pages of fun and pick your location and you You probably want to install the 32 bit and the 64 bit file because there's certain applications that might be 32 bit versus 64 bit. And there are occasions where you want to swap out which version you want to use. We're not gonna get into that here, but a lot of, for example, government applications are still using 32 bit versions of myself access, and sometimes we are. You want to access the access In order to do that, you need the 32 bit version I don't want to customize. And here we go. We're just gonna go through the menu system like any any other installation. I'm sure that you're all familiar with that. We're gonna load this up. We're not even gonna test this out. Honestly, once it's done installing, we're just going to go ahead and install the our studio in our studios. Gonna point to this version of our as its back end to actually run the scripts. You can also download previous versions of our For example, if your work only has, like, a two point X type version, you can download that as well as a 3.5. And within our studio, you can actually set up, which are to use which version so you can do it that way. Okay, are is all set up, and now we're just gonna set up our studio. Here is the other link for that. And again, it's a similar guide. Usually the default locations is what I use once in a while. You might want to change that to, say, an external hard drive and SSD drive to save space on your your local machine. But I think for the most part, using the defaults is all I'm gonna dio. I'm going to run this and what I'll do when this is done is I will open it up and I will show you a couple quick settings. And then we'll run our first script just to show you that it works. So let's go ahead and click on the start menu and win a type in our s for our studio. And you can see it pops right up here. I'm also gonna right click on it, and I'm gonna pin it to my task bar cause that's where I like to put things that I use a lot. So now it's down here. I'm gonna click on our now That's not our as in the R version 3.6. This is our studio, and it just happens to have the are as the icon. So here we go. We have a our studio set up. And let me just explain, this is the consul. This area here is your environment. Where you when you store variables, it'll show up there and down here is kind of like a nice little viewer for files for plots to install packages. You can do all that down in this area. And, of course, everything is customizable with size. This right here is the console, so we can do something like print and just type in. Hello world here. Hit, Enter. And as you can see on output line number one says, Hello, world. This is not in our script. This is just the our consul. Let's go to your tools and go to global options and I'll show you just a couple quick things. Remember I said, you can have different versions of our well. Here is where you would change that if you download it are version two point acts or a different version. For whatever reason, it would be changed right here. Finally, where we are going to go toe appearance because I like to set this as a different size font , especially when im I'm tutoring online. So I'm gonna set that up to be a little bit bigger and then also you can pick a a color theme, basically, so whatever you choose Eclipse or whatever. Let's just do a dark theme here and apply it, okay? And it says, Do you want to restart? Yes, it's really, really quick. Usually. Now this is pretty good because it's running in a virtual machine on top of my Mac book. That's a 2017 Mac book pro. So now I have my new colors. Let's start a new project file new project. This way we can contain all of our stuff in one directory. I'm gonna put it in a new directory and we're gonna do new project. We're gonna keep it simple. Simple, simple directory name. I'm gonna call it. Um, first, our project make it nice and simple. Now I'm gonna click browse just to see where the default location is, So it looks like the default is under your documents. That's perfectly fine for me. I'm going to keep that and I'm just going to click on Create Project and you'll see only a slight change. Your files over here are now under that project. See, it's under home first project right here. You could see the directory structure. Let's go ahead into a new script Now. This is your first R script file new or file new file our script. There's a lot of options and in other tutorials or further down the line with this tutorial , I'll show you and explain a lot of these other futures. Let's click on our script. This is a very first script, and we're going to the same thing when say, print in parentheses. Hello, World and the world never replies back, So that's really awkward. Now there's two ways to run this code. You can either hit. If you're on a a Mac book command, enter. And if you're in Windows, I believe it's control. Enter control, Enter here and you can see down here on output line number one Hello world. So you know it works. Everything is fine and that's it. Now you can save this file. Let's go ahead and click on Save Control s and I'm gonna call it first script, and that's gonna automatically default to the Project directory, which is right here for a script. First, our project, it's all accessible through here, which is nice. And then if you had plots would be here. Packages will get into packages a little bit later But to update packages or to install them, you can click on install. It makes it very, very simple. And for your global environment, that's again. If you have things in your ram or your environment. For example, this variable var variable A. I'm gonna sign that a number, and then I'm gonna do control. Enter and you'll see it shows up on my right hand side because it's stored in memory. It's using actual space, and here it is to clean that up. You can actually delete the variables by clicking on this broom and clicking. Yes, and then you could startle over was one way to do it. Click on Run. It will run the entire thing. And I thought there was a way to run the whole thing. But I guess I guess highlighting and hit and run is is just good enough for now. Well, that's it. I wanted to keep this shortened simple for a Windows or a Mac user to download and install our. If you have any questions, put them in the comments or comment below, or send me a message. If you have any tips or tricks, go ahead and put those down there as well. I love to hear from you, So stay tuned. There's plenty of more exercises and tutorials to go. 4. Scrape Reddit Comments R ExtractoR: Hey, welcome back, everybody. This is marching grass here and I'm at a website called reddit dot com. If you have not heard of it, you probably should check it out cause it's pretty a pretty awesome. Anyways, what I want to show you today is how to use our in a package called extractor Strack door to extract comments and to find you RL's that have specific terms in them. And what better way to do it in tow actually give you a practical lesson on how to do it in real time here. So let's go ahead and find first window the easy way. We're gonna find a ready that we want to check out. Like I've got quite a few here. Let's just click gone just randomly picking one. So this is live asked men over 30. I love this. Read it. Uh, so it's just click on this particular one here, and I'm gonna grab that you are all so I'm going to come and see that. And as you can see, there are comments. But there's not that many comments for this one. There's not that many, but that's OK because that's part of what I'm going to show you is find a way to find a ready that has more cons. So we'll do that programmatically as well. So this is a really good good way to do this. Let's go back to our I've got a little bit of a shell already built here. We're gonna skip some of it. Okay, So you could be using the tighty verse and the red extractor package. If you don't have those two installed, click on install and obviously this type being tidy verse, which you should have if you follow it along with any of my previous exercises and then read extractor just click on install, let it do its magic, and you're good to go. So ready? Extractor and tidy verse should be installed. Do that. Then you can load these libraries. What's load? Those two libraries. We're going to skip lines four through eight. I know it's hard to see, but we're gonna skip this for now. I'm gonna share this in a minute. What I'm gonna do with that because we already have a website. We're also gonna skip line 10 and by Skip, I'm literally just not gonna run those you should comment them out if you're never gonna run him. But we'll leave that there for a second for a placeholder. And now what I want to show this is this function called reddit content. Reddit content comes from red extractor and all you have to do this is how simple this is. This is amazingly simple is put it in quotes. So add some double quotes on both sides and control V or command V and paste that you are Ellen there. See, it's all in there and it's in there now. So on line for me. 12 redo command, enter and you'll see how fast this is. They're not the bottom. You see. It's 100 sent done What it did was it extracted those six comments, so there wasn't that many. So under content, I click on it now What you see is you see the subreddit is called Ask men over 30. You've got the comment date and you have the structure and a few other things on there. Let's go to the right and you'll see the actual comments of here. So you have all your comments. We've just literally extracted the comments and the girls to go with them. So what I want to do is I want to write those to see it be file so that we can maybe cut paste into some sort of, I don't know, a text to speech synthesizer through a YouTube video on there and make money off of it. Something like that, right? Who knows? So this right dot CS feed does just that. And I really only want the comments. So I'm gonna just go ahead and do, uh, content dollars. I'm come comment because that's the name of the feature or the header that you see the name header name. I'm gonna call it test out CS fee, and I don't need road names. Doesn't mean anything to me. So let's go to write that and we go to our files here and you see test on CSP. You go find this in your directory, if you like, as well, and I'm just gonna view it right now. So here they are. Here are the comments, which is kind of cool, because I think so. It actually captures the markdown language as well. That's pretty interesting. Now there's ways to programmatically strip the mark down his ways to automatically make the mark happen. There's all kinds of things you could do with this, but I've got the general idea here, So what you can do is copy and paste this into a text to speech synthesizer. Make that into a MP three, throwing on some sort of video. Put it on YouTube, right? So that's the idea. Is hideaway scale something so that I can always have YouTube content that I don't have to actually produce? Because there's only so much time in the day, Just just just an idea. There's there's already tons of that out there. People do with Python quite a bit, honestly, So anyways, there's one of many, many, many ideas you get, so we have that information there. Now that's cool, because we already have the u R L that we're looking for. Now what I want to show you next is this this function here called Reddit You RL's and it has a bunch of different parameters. It has more parameters than what's shown here. But what I want to do is only add search terms, so I'm gonna add a search term called Trump. In there, I don't know if it's case sensitive enough, so I guess we'll find out. And then the sea and threshold is your comment threshold. I don't want it to return any. You are Wells that have the word trump in it. So in red it the way that they do the u. R l has to do with the name of the title. So if the title has Trump, it's gonna pull it back. It's gonna give it to me, and it's gonna be Threshold has to have at least 20 comments of more to come back and the page special. Don't worry so much about that, but that's just a the amount of results you want, and you can probably get as many your rails as you want. But when you get them, you can't extract all the data at once. There's limits to how much you can extract at one time. I get t get more on that express to second what's going to run this? We get the word trump in there hit. Enter the see if we even get any results so it looks like links. This might be from a previous time. I clicked it because as you see down here this little stop sign, but it's gone now. That means it ran. So 92 observations of five variables. Let's click on that now. I have links, so I have a title C. L has trumping. It looks like it's not case sensitive, which is good cause really Trump or Capital Trump. Here's Trump here. That's lower case. And so you have all the girls, and that's what I'm really after. The girl Now this might be useful in itself has a number of comments and the title, which is kind of cool. You can start these by comments if you click on that. So 23,000 comments on this particular one with this particularly you are else. You could see how this could be beneficial. It depends on what you're doing it on. It tells you the subreddit. It's under a subreddit called ass credit. So pretty, pretty handy information here. So let's go ahead and take what we found here this knowledge and bring a wandering the comments in based on maybe the top comments, something like that, right? So let's see if we can do that. It looks like our road numbers are over here, but I'm not sure if that's really true, because, um, I'm not sure. So there's 252 and there's only 92 observations, so that is not true. Statement. That is not true statement. So disregard these numbers over here is not true. So you've got to be careful with that because this is a package that we don't know what it's really doing. So what it probably does is it goes through every u R L and IT district discards the ones that have less than 20 comments. But it kept the road numbers as if it didn't discard that. That's a very subtle thing. That's hard to see. But just so you know, so these road numbers are not actual observation, so be careful with that. Let's just go ahead and close this out and let's go down here where I have this other content. So I'm gonna comment this one out, which shouldn't matter if you run these in certain orders, but I don't want you to get confused. So So this particular way right here was when we had the exact earl we already had. But now we can extract the girls this way we can actually plug in. Hey, I want the fifth you around 10 through a row and do it that way. It's going to the exact same thing. Is it as if I if I copied and pasted it, But we get shoes from what we just picked up here. So maybe we want to do the max, right? Maybe we want to find the one with the most amount of comments and extract those. So how can we do that? Let's see. Let's see if we can do this here, we could do a We do a max function, and I know we can get back by bringing in the links. Dollar sign, Numb comments. This will give us the number, though not the actual index number. Like I still don't know what row that IHS. Right, So that's that's also a problem. Another row. So what I can do is I can say bring me back. Let's let's use the tidy verse. This is why we have it. We're gonna do, uh, links. I'm gonna bring in a links data. Hype it over to a filter function filter, Amazon filter. I would say I want the num comments to be equal to the max number of comments. And if I hit command entering that, you get the actual your else to the euro. Actually just showed up down here, so that's what we want. So we're going to say, Max, you are l is equal to that. This is just on the fly. I don't know. Let's let's try it, Max. You were out. So I'm gonna come and see on that. And instead of the this particular you are all right here where it was just grabbing the 1st 1 I want to leave that out. Command V in there. So I have Max you around. What, Maxwell and the wait time to is where? If in the a p I for Reddit, they're gonna let you extract only a certain amount of comments per minute for two minutes per whatever. And so the default is two minutes and that's the least amount, so you can grab a chunk of comments. Wait two minutes, grab another chunk, wait two minutes. So that way, everyone's not using all of their comments and bugging down the system and make it. Maybe, you know, unfair advantages or something like that. So let's go ahead and run this kind of command. Enter Kara Invalid Ural parameter and knew that was gonna happen. So, Max, you are l IES, actually, Because of the way this is set up in the title burst, it's actually a data frame itself. As you can see. Right here, Max zero. If I click on it, you'll see it's a data frame with one observation. So but we're almost there, so we know we want to. Maxwell, in the only observation in their 1st 1 Boom. There we go. Come in. After that. Who? Oh, uh, double whammy dollar sign. You are. L first you around. Here we go. Well, I'm glad that I could solve those pretty quick, but you're gonna run into that quite a bit. So just don't be discouraged if it happens to everybody. So I have content now. 400 observations of 18 variables. Let's click on content. And now based on that, you well, which you can see over here. I still should have your el Yet on the right hand side, I've grabbed all the comments and you have all this different. All these features the comment date, the number of comments, things like that which the number of comments are the same because we grab the Max River was 23,000 something. There it is. So we have that. And over here we have a feature called comments and we have all over the comments. Now, I am only particularly interested in the comments, so I'm gonna go ahead and extract those and put those two a CSB file. So that's what I have here. So I have my content that I have just read all the comments on and I only want the comments so the dollar sign comment will get the feature, and I want to bring it to a file, write to a file called Test at CSP, and I don't need the road names. I'm gonna hit command. Enter on that. And normally you wouldn't quite do it like this. Where the comment. It's just because not really comma separated value. In that case, I don't think let's click on Test that see his feet to what it was. Well, it's more like a uh uh, Every time a common happens, there's a return character in this particular case but it's a separated value to see that each comment is on its on line sort of speak right and what you'll notice to somewhere maybe, is Ah, yes. So like this one right here that's called mark down language. So it actually keeps the markdown language. Now there's ways to strip out, mark down and converted to non marked down. There's ways to do all kinds of things with this right here. But this is actual. The actual comment. Somebody put in red it with that language with brackets with princes with girls read. It allows those inside of the comments. So we've captured that, which is actually pretty awesome data have. But it might not make sense if you do a text to speech recognition symptoms that are text to speech synthesizer and put this on YouTube and try to monetize other people's content, hopefully non copyrighted or something like, so that's the idea. So again, let me run this run this through one more time. We have search terms equals Trump. This is to get the links that have those terms or the titles. We didn't have to do that. We could have plugged in down here online. 14 the actual website that were interested in the URL for the Reddit. And then you wouldn't use this line here, but we did one of the max amount of comments. We wanted that particular story. Right now there's other terms inside of this Reddit Euros you could have put in here. For example. I couldn't put in comma sort sort by equals and new. And that would just give us two days. Let's see. So right now links has 92 observations. Let's go and run this one more time. The links so anywhere in here just to command enter. And now I have nothing. So there are no, um, nothing new. I don't know what the threshold is for new on this, so we have to look that up. If you did a question mark, read it extractor, and then you have to go find out through documentation. What new means So new is definitely not gonna get us far with this. New could be within the hour. Could be within a day. I don't know. It's a little bit vague, but for now, just know that there are different thresholds. You can set different parameters and That's it. It's pretty cool. Now do what you want those and expand on this and try to automate mawr of your daily life. So that's what I'm gonna do. I'm gonna take these comments. I'm gonna try to see if I can spit thes into, like, a final cut pro fight plug in, maybe and automate comments and they're already out there, so I know people are doing it, but now you can do it just like that.