Natural Language Processing with Python

Data Science Rebalanced, Data Scientists

Get unlimited access to every class

Taught by industry leaders & working professionals

Topics include illustration, design, photography, and more

Get unlimited access to every class

Taught by industry leaders & working professionals

Topics include illustration, design, photography, and more

Lessons in This Class

- 1.
  
  Class Trailer
  
  1:47
- 2.
  
  What is NLP?
  
  2:10
- 3.
  
  Course Overview and Tools
  
  1:37
- 4.
  
  Load a Jupyter Notebook
  
  1:51
- 5.
  
  spaCy
  
  2:14
- 6.
  
  Python Libraries
  
  0:39
- 7.
  
  About the Data
  
  3:52
- 8.
  
  NLP Terms
  
  5:03
- 9.
  
  Preprocessing Text Data
  
  12:09
- 10.
  
  Term Frequency
  
  9:47
- 11.
  
  Named Entity Recognition
  
  8:29
- 12.
  
  Part of Speech Tagging and Dependency Parsing
  
  6:52

Beginner level

Intermediate level

Advanced level

All levels

264

Students

Project

About This Class

If you’ve ever wanted to learn how to analyze text data with Python, this course is for you!

Leah is a Data Scientist at a large financial institution and discovered there is a serious gap between the skills and techniques students learn in school versus what they actually need on the job in the real world. She'll use her expertise to teach you the foundations of natural language processing (NLP).

This course is intended for aspiring data scientists and programmers looking to expand their knowledge of NLP.

In this course you’ll learn:

NLP terminology used in the industry
Text preprocessing techniques
Named entity recognition
Term frequency
Dependency parsing
Part-of-Speech tagging

You’ll gain hands-on experience with each concept by analyzing 500 Amazon Home and Kitchen product reviews.

Throughout the course, you’ll walk through code examples in Python using a Jupyter Notebook. You’ll also utilize popular libraries such as pandas, spaCy, and scikit-learn. No prior knowledge of NLP is needed for this course; however, a working knowledge of basic programming concepts (functions, for loops, etc.) and intermediate Python skills are recommended.

Music by TimMoor from Pixabay

Meet Your Teacher

Data Science Rebalanced

Data Scientists

Teacher

Leah Berg and Ray McLendon are Data Scientists at a large financial institution and have over 15 years of combined experience. They have a passion for seeing people grow and become the best versions of themselves. When Leah and Ray graduated from university, they struggled at their first Data Scientist jobs and quickly realized that academia only told half the story.

While their degree programs placed a large emphasis on machine learning algorithms with perfectly cleaned and balanced data sets, they found the opposite true in the industry. Every problem they encountered required 90% of their time spent focusing on messy and imbalanced data sets, as well as the people generating those data sets.

Leah and Ray created Data Science Rebalanced to help data scientists new to the... See full profile

Related Skills

Development Programming Languages Python

Level: Beginner

Hands-on Class Project

Now that you have analyzed 500 Amazon Home and Kitchen product reviews, your task is to analyze the review summaries (i.e. the summary field in the data set) and see how they compare to the product reviews. You'll need to preprocess the summaries (hint: the steps may be different than what was used in the course), gather term frequencies, apply named entity recognition, and perform part-of-speech tagging / dependency parsing.

Share your final code with the class by uploading to the "Your Project" section.

Class Ratings

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Class Trailer: Hi everyone, Welcome to today's course which is on natural language processing in Python. My name is Leah and I am a data scientist at a large financial institution with about four years experience. My coworker, Ray and I really wanted to make these videos because we noticed there's a huge gap between the skills that you learn in school versus the ones that you actually need in the real-world. So all of our videos are going to be focused on using real-world datasets, real-world problems, and also giving you the skills that you need to solve those that they don't necessarily teach in school. So we're really excited to have you today and hope you'll stick around. In today's lesson. We're going to be covering the basics of natural language processing, including some preprocessing techniques and part of speech tagging Named Entity Recognition, dependency parsing, and term frequency. And you may not know any of that means yet, but you will be a pro by the end of this video. This tutorial is really meant to be beginner friendly. We're going to start from the beginning with Jupyter notebooks. You'll be able to follow along with what I'm doing and use some really popular libraries such as pandas, Spacy, and scikit-learn. Now these are really popular in the data science community in general, but especially for natural language processing. The data that we're going to use for today's course is a set of several thousand Amazon reviews. I chose Amazon reviews because pretty much everyone has bought something from Amazon at 1 and most likely left your review. So this is a really familiar dataset to everyone. Specifically, we'll be taking a look at the home and kitchen category. But I'll also link you out to the dataset where I pulled this from. And you'll be able to see a bunch of different Amazon category reviews such as tech, baby, makeup, anything like that. And with that, let's get started. 2. What is NLP?: So in today's course, we're going to be talking about natural language processing or NLP with Python, a little bit of background on what NLP actually is. It's really at a high level, just a way for computers to understand or process human language. If you take other courses on an LP, you might hear the term natural language understanding. And some courses make the distinction between natural language processing as actually breaking down the text data into a form that computers can understand it. And then natural language understanding is recognizing the relationships between words within the sentence. Relationships between sentences, or even relationships between the whole documents to one another. So a couple of examples of NLP that you're probably familiar with in your own life. Or our number one, your Amazon, Alexa. A lot of people have these devices in their homes and you're able to ask a question like Alexa, What's the weather? And she will give you back an answer saying, the weather is currently 65 degrees in pulsar. So behind the scenes, really what Alexa is doing is using some natural language processing or natural language understanding to be able to take your text or speech that you actually said out loud. And then convert that into written text and then convert that into something that computer can understand. Processes that get the answer and then send it back to you. Now, Amazon has put in a lot of time and energy and research into this. They really make everything seem super simple. You ask a quick question, you get a very fast response, but there's actually a lot going on underneath the hood. So another example of NLP that you might be familiar with in your daily life is a predictive text on the iPhone. Now if you're not familiar with this, this is where you'll start typing a sentence in your text message. And then Apple will try to predict what you want to say next to it, make it easier so you don't have to actually type out all of your words. You could just click the buttons for what you want to see. So in the example here we've typed out want to grab lunch. We're going to send that to Jane. And then Apple is giving us some little emojis that we might want to put it with it, and also some other words that we might want to include there as well. 3. Course Overview and Tools: In today's course, like I mentioned earlier, we're really starting at a very beginner friendly level. So if you don't know very much about NLP, That's great. We're going to learn all of that in this course. And then if, also if you're not really familiar with the library spaces, I can't learn and Pandas, We're gonna go through those as well. Now I do have listed here that you might want to have some experience with Python 3. This really isn't extremely necessary. We're going to be using Jupyter Notebooks. So you should be able to follow along really easily with what I'm doing. But I will be using some more advanced or intermediate to functions within Pandas, such as apply statements with lambda functions. But I'll also be talking through those as we go through everything. The topics that we're going to cover in today's course, our preprocessing term frequency, part-of-speech tagging named entity recognition and dependency parsing. Now these are just really the tip of the iceberg when it comes to natural language processing. Because really for any of these, you can have a full-blown course about them, but we're going to give you the basics here. And then you should be able to apply your understanding in real-world problems that you see, the tools for the course that we'll be using our Jupyter Notebooks. Like I said, I'm going to be running this through PyCharm, nauseous, my preferred IDE, but it few are using Anaconda. That's perfectly fine. Just as long as you're able to run a Jupiter notebook somehow, that's great. And like I've mentioned before, the libraries that we're going to be using mainly our Pandas, scikit-learn and SPAC. And again, if you don't know anything about those libraries, that's totally fine. You're going to learn today. 4. Load a Jupyter Notebook: All right, so now we're going to start walking through some code you should have downloaded on your file, the Jupiter notebook from the lesson, as well as the dataset that the way that I'm going to open this up is through PyCharm. This is my preferred editor. I generally don't like using Anaconda just because it doesn't really play well when you're trying to create executable files and do more traditional software development. So I'm going to be using PyCharm, but if you have Anaconda or your favorite IDE, that's not one of those to a load it up on your computer. Feel free to do that. But I'm going to just open up a Jupiter notebook by going to the terminal and then typing Jupiter. And what this is gonna do is spin that up. So you should be able to do this in whichever IDE you're using, but I just prefer to use PyCharm. Okay, So we can go ahead and get started now that we have our Jupiter notebook open, I've included a lot of text here. So this could really be a standalone Jupiter notebook with this background information, but you'll be getting a lot of this through the slides that I'm talking through in other portions of this video. So oftentimes I might skip over this, but if you ever need to go back and refer to anything, say you forgot what tokenization was there something you can feel free to just look in the Jupyter Notebook in these text sections, I did also include a link to download PyCharm. If you want to do that. I've just downloaded the community edition, which is free. I'll go out to that link just to show you guys what that looks like. This is my preferred development environment, like I've said, but I know a lot of people use either Anaconda because it comes with Python and all of that good stuff. Or VS Code and other popular one. So pick whichever one works for you. Pycharm Community Edition is free, so you just download that and install it on your computer if you'd like to play around with that. 5. spaCy: Now, if you are completely new to natural language processing in Python, you might not have heard of this library called spacey. What's basically is, is just an open source free library that is really popular in the NLP Use phase. A lot of people use it for its pre-trained pipelines, which are stored as models. So the way that space he stores things is a little bit complicated. It tastes a little bit of getting used to, but we're going to break it down. And overview here is that we're going to take some texts to put it into this NLP engine, which does a lot of things behind the scenes. It's going to tokenize our words for us. It, it's gonna do some part-of-speech tagging, do some parsing named entity recognition. And really you can also add in your own custom steps if you want. But this is just an example here. And once all that is done, basically saves out off as what's called a DACA. And then from there inside of the dock, you can access all of these things. And it'll make a lot more sense when we go through the tutorial actually, but this is just a high-level overview for now. So like I mentioned in the slides, we're going to be using a library called SPAC, which is extremely popular for natural language processing in Python. And one of the things you have to do here is first install a model that space is going to use it. They have small, medium, and large versions of models. You can also train your own model if you want. So for this demo, we're going to download the small English model that was trained on text from blogs, news and comments. I'm going to first go out and show you what these models look like in the SPAC documentation. So you can see right here we're using the English Core web small, and it gives you a little more information about those. But you can also see even medium out there, large and they have some other versions as well. There's also is important for a lot of different languages. So if you're working with data that's in a another language other than English. You can go out here and download, say, a Spanish model if you want to do that. So this first step here where just downloading the model. And I've already downloaded this. So it's probably going to give me a message saying that I've already downloaded it. But for you, you'll probably have to sit through the download which shouldn't take too long. Okay. So it's saying that I already have it downloaded. That's totally fine. Just make sure you get your is downloaded. 6. Python Libraries: Have all of the imports that we're going to be using for this dataset and analysis. So we're gonna be using pandas spacey, scikit-learn, a visualization tool called yellow brick and also the path library. We also go ahead and load in the English model from Spacey and just label that as NLP because we'll be using that later. So I am going to import all of those. If you're running this for the first time, we probably need to install all these libraries as well. So just make sure you do that. I have also included a requirements file. So if you are using PyCharm, you can make your own virtual environment and do just a pip install from the requirements file. And it will download everything you need from there. 7. About the Data: Next up, I wanted to give a little bit of background about this data that we're using. Huge shout out to Julian Macaulay at the University of California, San Diego, who actually put this data together. He provides a ton of different Amazon reviews from back in 1996 all the way through 2014. And for our purposes today, what we're gonna do is just analyze a subset of 500 Amazon kitchen and home product reviews. So in addition to the reviews, that includes the ratings, text, and healthfulness modes, Julianne also gives us some product metadata, such as descriptions about the product category, information, price, brand image features, and also links for today's course, we're going to only look at the review data. So this will just be the readings, the text of the reviews as well as any helpfulness vote. So, so how many people found this product review helpful? I really like this dataset because most of us have probably purchased something from Amazon at some point in our lives. And potentially I've written a review. So this is really familiar to a lot of us. I also really liked this data because it is an awesome example of how humans really type or communicate through text. And that's going to include a lot of typos run on sentences, all caps to include emotion, a lot of exclamation points and anything under the sun you can think of pretty much. Whereas if we were dealing with any sort of formal text, wouldn't have as many of these errors and more emotion that people use generally in the way that they might write product reviews. So I'll go out to this data and show you guys what is actually out there. Can feel free to go out to this website and take a look. He includes a ton of different product review categories. If we're going to just be focusing on the home and kitchen section. But if you wanted to look at reviews for books, CBS Sports, you name it. There's some data out there for it. And he gives a lot of information here about what these datasets actually mean. So feel free to go and read those in your own time. Julian was also really nice to provide us with a couple of functions to be able to parse the data because it does come in JSON format. And we ultimately want to get it into a Pandas DataFrame to be able to manipulate it a little bit easier. So he's already written these two functions for us. One to parse the file path and then wanted to actually create a DataFrame from the JSON file. So we're going to go ahead and use those functions. And then here we're just reading in the DataFrame and then taking a sample of 500 items, this random state option, you can set that to whatever you want it. But if you want to get the same dataset that I am working with, you just leave this at one, but this is helpful for reproducibility. And so go ahead and run that. And it will take a little bit because there are actually about 0.5 million reviews. They're in JSON format. And then we have to parse through that JSON to get the Pandas DataFrame that we want to, for the sake of this tutorial and not having things run super long, we're going to just take a subset of 500 of those. And so we'll let that run for a minute and then come back in a second. So looks like we are loaded up and let's take a look at what the DataFrame actually looks like. So you can see here, pandas gives us this unique ID, then we have this reviewer ID. Another idea it looks like maybe for the product that reviewer name a piece telling us how many people found this helpful versus didn't find it helpful. The text of the review, which will be the main thing we focus on in today's lesson, the overall score out of five, the summary of the review, and then some columns about what's on the review is made. So you can see for here we have few from 2013, lot from 2011. But this data does go all the way back to 1986. 8. NLP Terms: So when you start off in the NLP space, you're probably going to start hearing a lot of terms that you haven't heard before. I know, especially when I started, it was a little bit overwhelming hearing the different vocabulary that he used for this. So in this course we're going to start by defining all of those terms and then show how they are used. First off, we're going to start off with token, and you can think of this as a grouping of characters. So in this example, we are tokenizing at the character level. And this means we're taking a sentence where we're starting out with she was offered the job 11 months ago and taking each individual character and making that a token. Now this is really helpful when you have short texts such as filenames or in the example of the data that we're using. But that might be the title of different view. So in those cases there might not be enough texts there to be able to token ions at the word level very well, especially with file names. Oftentimes people will put underscores or other kinds of characters in there so we can't actually separate on whitespace. So oftentimes it can be useful to break it down at the character level and then send out into other processing steps. You can also do tokens at the word level or a unigram. So in this example, we're taking that same sentence and then splitting it a different way. So we're actually splitting it up into what's known as unigrams, which are basically can think up as words. So here our tokens are, she was offered, so on and so forth. Now you can also group tokens in groups of two as wells. Instead of breaking things down by words, we can put two words together. So this takes our sentence from she was offered the job 11 months ago to breaking it down into two word tokens, she was offered the job 11 months ago. Bigrams are also very interesting because they gave us positional information. So we can tell, in this case, our first bigram, she was we can tell that she is the first word and was is the second word. That would be different, for example, of a bigram called was she was would be the first word or the first token and not and then she would be the second token. Now this in the industry is actually known as n-grams, and you can do as many groupings as you want. So bigrams would be two groups of tokens together. You can do trigrams for three groups of tokens together, so on and so forth. You can also do these kinds of tokens at the character level as well. So we could group SH and EW as tokens. And when we get into the code here we're going to see some examples of why we might want to use unigrams versus bigrams. But tokens are really the foundation of natural language processing. So once we have all of our text into tokens, what we can do is group the tokens together as documents it. So you're going to hear document allies. And normally what you might think of as like a Word doc or a piece of paper or something. But really, when we're talking in natural language processing, a document is just a group of tokens. So an example of that is right here we have document one. That was she was offered the job 11 months ago. Pashas to single sentence, but we call it a document here. And then compared to document to, the two girls went to the park after school, they saw three squirrels and chipmunks. Notice that that's two sentences it, but altogether that gets grouped as a document. Now you could do this with more than just sentences. You can do. For example, paragraphs group those together. You can group pages or entire documents as well. From there, we go up another step. And when we group documents together, that's known as a corpus. So you can think of a corpus as your entire group of all the documents that you have. And an example of that, we take the two documents that we just saw in our last example and say that we group those together and that's going to be our corpus. So in today's course, we're going to be working with Amazon reviews. All of the Amazon reviews are going to be our corpus of documents. So the next term that you're going to hear quite often in natural language processing is called vocabulary. And vocabulary is really just the unique tokens that are in your corpus. So it's again an example of a vocabulary we're going to pull from our example that we just looked at up a corpus. And all we're gonna do is get the unique tokens between the two documents. So I've listed all of those out below and you'll notice, for example, that shows up in both documents. But it's listed as two individual tokens within the vocabulary here. One with a capital and one with lowercase. And this'll be important when we talk later about some of our preprocessing techniques so that we would actually potentially lowercase the sentence. And then we would only have one instance of that. 9. Preprocessing Text Data: Let's talk about preprocessing steps. There are a wide variety of preprocessing steps that you can do to your data. And depending on the situation, you might not use all of these preprocessing steps that I'm going to talk about or even do them in the same order. It totally depends on your use case and your data. One of the most basic preprocessing steps that you can do is lowercase your data. This really ensures that case won't affect any term frequencies are word counts that you're doing. So let's take an example sentence. She was offered the job 11 months ago. This is written the way we'd probably write it with a capital letter at the beginning of the sentence and then a period at the end. So what we would do this sentence as our first preprocessing stub potentially would be just to lowercase everything. So all we're changing here is just making that capital as a lowercase s. So that way when we're identifying tokens, we're not counting she with a capital S, SHE with a lowercase s as two separate tokens. We want those to be recognized as the same token. Now you might not want to lowercase things if you are potentially working with data where it's real people writing and maybe they're writing in all caps to express a certain sentiment. In that case, you might not want to lowercase everything. You might want to extract the words or tokens that are in all caps. For example, the next preprocessing step I'm going to cover is removing punctuation. This is exactly what it sounds like. We're going to be removing characters like periods, exclamation points, question mark. So if really any sort of punctuation, the reason we want to do this is just to clean up our data a bit so that we are not counting these punctuation as tokens. Really, we probably don't care about those. We really care about the words themselves within the text. And so oftentimes we'll get rid of those in our preprocessing. However, you might not want to remove punctuation if you're trying to separate sentences within a document, might want to leave those in so that you have a way of telling this as one sentence, this is a second sentence, so on and so forth. So an example of removing punctuation. Here we have our same sentence. We have a period at the end to remove the punctuation are going to do is get rid of that. And you'll see in the next sentence that we have no punctuation. And other step for preprocessing you might do is removing any numeric characters between 0 and 9. Oftentimes when you're dealing with text data, you want to get rid of numbers is because they don't provide a lot of value. Oftentimes we're interested in the words themselves and not any numbers. But there may be situations where you want to keep numbers if you're dealing with text that has a lot of date. So you might want to leave those in to be able to identify this text was from this state. Here's an example of taking out numbers. We're going to just remove 11 from the sentence that we're working with here. This text preprocessing step is all about removing stop words. And if you haven't worked in natural language processing before, you probably haven't heard of what a stop word is. Really what it comes down to is just unimportant words like and for anything that's short and doesn't really add a lot of value to a sentence, but also happens a lot. Like think about how much the word that is used in the human language. Really, if we're going to see what the most popular words in a sentence are trying to figure out what kind of sentiment someone is expressing. The probably isn't going to add a lot of value. So oftentimes we remove those. So you might be wondering, how do you come up with a list of stop words are, what is the list of stop words? Different libraries that use different lists. So you might try one library that uses a certain list. You might try another one that uses another list and include some other words that the other dead end. And you also can create your own list of stop words. For example, if you have a dataset that is highly specific to say in financial data, and you don't want to include words such as budget or bang, things like that that probably show up in a lot of financial text data. You can include a dictionary of all of those words and then tell Python to remove those as well and count those as your stop words. For this example, the stop words that I'm using are just going to be the typical ones that a lot of libraries use. So we're going to end up removing she was. And so the sentence we're left with is offered jobs months ago. The next preprocessing step I'm going to go over is what's called tokenization. At the beginning of this course, we talked about tokens and how you can get tokens that are words. You can do tokens that are characters. You can really create whatever kind of tokens you want it. But just think of tokenizing a document as separating the text into smaller units or tokens like we've talked about earlier. Now in this example, we're taking our string or a sentence offered job months ago, and then we are just splitting that on whitespace here. I'm splitting it so that it's words it. So we end up with offered a job months ago, all as smaller units of the original string. In natural language processing, you're probably going to start hearing about stemming and lemmatization. And these are ways of taking words and getting them down to their original form. These techniques are pretty similar with a slight nuance. It stemming actually just cuts off the last few characters of a word to get to the root word. While lemmatization actually uses parts of speech tagging to be able to convert a word into its route for IMA. Now there's different use cases for why you might use stemming and lemmatization. I tend to prefer using lemmatization because it converts the words into a more readable format. With stemming, you might have a word that just off the last few letters and you don't actually know what that word was. And so even though I prefer lemmatization and stemming, it is worth noting is actually a lot faster because it's just chopping off the last few characters in the token. Whereas lemmatization needs to go through a round of part-of-speech tagging, potentially some dependency parsing to be able to get the correct lemma for the token. So an example of stemming here is taking the word trouble and just taking off the last letter E to get it down to its root form. Now you might think of like troubling or trouble old, those would all get transformed into trouble. The TRO. Now lemmatization takes a word based on its part of speech and gets it into its root form. So it's a little bit more readable. Here you can see that we took offered a got transformed into offer the present form of it to jobs stay the same months. We didn't take off the acid just gets to its root form a month and ago doesn't change. So the final result, if you remember where we started on this sentence, she was offered the job 11 months ago. We actually get to offer job month ago as our final set of tokens. And this helps cut out a lot of the words we don't care about and really get down to the meaning of the sentence. We'll start off by preprocessing our data. You can absolutely write your own functions to do all of these preprocessing steps such as lowercase, groove and punctuation or numbers, stopwords, tokenize and lemmatized. But SPAC is actually really nice and does all of these based on token attributes if they have a lot of different attributes on their tokens. And I'm going to go out to this link here, where in the documentation you can see all of the attributes that they have. So for each token, you can see if there's whitespace in it. You can see what type of entity it is. You can see the lowercase form of the token. You can see the shape of the token. I have all of these different attributes. So Spacey has a lot of different token attributes, but the three that we're going to be interested in for doing our preprocessing our token. That lemma, which gives us the lemma of the token, token dot is alpha, which removes punctuation and numbers from the token. And then Token dot is stop, which removes any stop words in this text here, I'm just taking this same example that I went over in the slides. So she was offered the job 11 months ago, applying the natural language processing to it to save it off as a document. And then saying, let's return back the lemma for each token in the document, as long as it's not a numeric character and is also not a stop word. And this syntax, if you're not familiar, is called list comprehension, which brings us back a list of the clean tokens. So we'll run that and then print out the document. So as you can see here, when we print out our spacey dock, it looks like nothing happened. It's basically storing all this information behind the scenes. We want to actually see the clean texts will print out texts clean. And you can see here that what we're left with is just like we had in our slides example, we're only left with the following tokens. Offer job month ago. So this is just a toy example to see how this works. But what we're actually going to want to do is tokenized and clean up all of our reviews that we have. So today this, I wrote a function called preprocessed text. It takes in a SPAC docket and then returns back a string and I give a little bit of information about what this function is doing. This is a really good practice that I would highly suggest you get in the habit of doing is writing what are called docstrings to be able to make your code more readable and allow others to pick it up more easily. And what we're doing here really is preprocessing a SPAC dock by monetizing it, removing any stop words and then removing non alphabetical characters. And again, we're taking in a SPAC Doc, which is a sequence of token objects and returning back to the clean to text, the actual code that we're running is the same code that I had up above, except what I'm doing is instead of returning a list, I'm just joining back all of the tokens together in a single string. This makes it a little bit easier to read. So first we will apply the NLP model to all of the review text column. And I'm doing this via Pandas, apply with a Lambda function. If you're not familiar with what a lambda function is, is just an inline function where x is our variable here. So I'm applying that to all of the data within the review text. And I say that off as a new column called space. The reason we want to save this off as because it does take a lot of compute resources to be able to run spacey since it's doing so many things behind the scenes that we're going to be referencing this over and over again. We don't want to calculate them multiple times. So save that off into a new column called spacey doc. And then from there we're going to work on the SPAC doc and apply this function that we just wrote up here called preprocessed text. We'll save that off in this new review text to clean column. So I'm going to run that and it's going to apply to those columns, make our new column, and then we will print out the results. So let's print out what Spacey Doc looks like. This is just the first five rows of our DataFrame. So what this has done is tokenized, done some Named Entity Recognition behind the scenes, dependency parsing, a lot of things. And it's just showing us that these tokens are separated by commas and blending. If we print out the cleaned review texts that we got to our original texts for this one was I opened the box and to my surprise, based on all of our preprocessing steps, what we're left with is open box, surprise, scoop, handle, send asap. But so you can see that we're cutting out a lot of these words that probably we don't care about. They happen a lot in the English language and really getting down to the words that are making a difference. Or I did a lot of value to this sentence. So, so that's how easy a space he makes it to preprocess. But like I said, depending on your use case and what data that you're working with, you might not use all of these preprocessing steps, or you might actually do them in a different order. So it's very important to think about the data that you have and what kind of steps you would want to apply to the data to clean it up. 10. Term Frequency: So a formal definition of term frequency is the number of times that token appears in the corpus. And really what that translates to oftentimes is doing a word counte, that this can be a really great way to summarize data. So for example, in our data we have hundreds of thousands of reviews. We don't have time to sit through and read through every single one of those to get an idea about what people are talking about. We could do term frequency to see what are the most popular words or terms people are talking about. And that could give us a better understanding of the data in a very short amount of time instead of spending hours, potentially days reading through each and every individual review. A really popular way to do term frequency is through Scikit-learn count vectorizer. Count vectorizer is really nice because it does all the work for you. You don't have to go through all of your tags, loop through each token, store account for all of those. It will just do it all in one for you. There are several different parameters within count vectorizer that you can tweak. And I'd encourage you to go out to scikit-learn documentation and take a look at those for yourself. But because they do impact how the words are counted. But the two that we are going to be focusing on today are stopwords and engram range for count vectorizer stop words. You have a few different options. You can either not use any stop words. So you've already preprocess your data and taken out stopwords. That might be a case where you don't use stopwords or if you actually want to include stopwords in your account. So you could do that as well. You can use their built-in English stop words, or you can pass in your own custom list. And this might be something where you have specific data related to a certain industry and words pop up that aren't necessarily common across the entire English language, but very specific to your dataset that you don't care about. So you could pass in a list of those to remove as well. And gram range is a really powerful perimeter of count vectorizer. You can pass in whether you want to see just unigrams, it just bigrams, unigrams and bigrams, trigrams, really up to any range of n that you want to. And this is really nice because it allows us to not only view what words are really popular, but also start picking up on what phrases people are using. So next, let's get into some term frequency, or you can think of that as a word counter. Like I said earlier, we have in total in this dataset about 0.5 million reviews. And we've only sampled about 500 of those. So that definitely does make it more human-friendly to be able to read through those. You can potentially go through all 500 if you wanted to manually. But a really great thing we can do here is start to get some plots of what kinds words or phrases people are talking about in these reviews. And to summarize them more quickly, instead of having us manually read through all of the reviews people have left. To do this, we're going to be using count vectorizer from scikit-learn. Count vectorizer has a wide variety of parameters. I'm going to go to this link and show you guys what that looks like. So this is the documentation from scikit-learn on count vectorizer. And you can see these are all of the inputs that you can give it to. So you can tell it if you want it to lowercase things, if you want it, how you want it to tokenize, whether that's at the word or character level. Can also include stop words and a token pattern. There's a lot of different options out here. So feel free to look through those and play around with some of these different options and see how they affect the data. But today we're going to just be focused really on the n-grams and the stop words. So the way that we use count vectorizer from scikit-learn is we call count vectorizer. And for us, I'm going to leave all the defaults except I'm gonna say, let's use the English stop words. And then let's start out with an engram range of 11. So this means I'm only going to see unigrams or single words. And so we save that off as this variable count vectorizer. And from there on vectorizer we do fit transform function. And we pass in our review texts clean, save that off as docs. And we can do vectorizer dot get feature names. And that gives us the features. And that is basically getting our word counts for us behind the scenes. Now if we want to actually plot those, we can use a library called scikit yellow brick and visualize those really easily. So from scikit yellow break, we're going to be calling this frequency distribution visualizer. And we're going to pass in the features that we just created, as well as the size of the plot to then we do a fit on that and then finally a show. So let's see what that looks like. Here we can see a plot of the frequency distributions at the top 50 tokens from our 500 reviews that we have pulled. The top word is used, which probably makes sense if you're having a product, you're going to be using it. So we've got some positive words like leg, good, gray to looking pretty positive so far. And we've got some words like coffee pan cup down here. And just like that, we were able to summarize the top words that people are using in their Amazon reviews for 500 reviews. Now say maybe these just single words aren't really enough information for us. Well, we can actually do is go back to our vectorizer and change the engram range from just unigrams to we could just do bigrams or two words together. So if we rerun this cell and then redo our plot will see at what two words people are using together a lot. So you can see we've got easy, clean stainless steel work, great, easy Hughes at non-stick. And you can go through here and look at that in a lot of them makes sense because the realm of Amazon products that were in that K home and kitchen. So cast iron, stainless steel, all of that stuff makes sense. Dishwasher, safe cutting board. Yeah, it's really interesting to see and you can kinda start to think about what kinds of products are people reviewing potentially from these sets for bigrams? Now one thing we can also do is you don't have to do just unigrams or just bigrams. We can even do unigrams and bigrams together. So we rerun that sound, rerun the plot. We will see the unigrams and bigrams together. But it looks like for this case, the frequency of the unigrams is actually higher than any of the bigrams. So that's why we're only seeing unigrams on this plot. But Definitely interesting to see, and you can also do trigrams. And let's see what three-word phrases people are using. So we've got sensor soap dispenser and non-stick coating. This would probably be it works like a charm, but the preprocessing that we did changed it to work. Like you can look through here and see kind of what people are talking about and potentially what reviews people are leaving as well. So in addition to what kinds of words or people are talking about, we also might want to figure out how long is the average review people are leaving. So to do that, we can actually take our spacey doc and just do lambda function to get the length of that, save that as the entire token counts, and then take a look at our review text clean to see how many tokens were actually dropping from the original text versus the clean text. So if we run that, we will get two new columns called token count all, which would be the token count, all of the tokens that, and then we'll have a new one called token count clean, which is the token count of the preprocessed text data. Next thing I'm gonna do is just plot that by getting the value counts from there and then doing just a bar chart with what our average token count is. So you can see we've got a pretty wide variety of links of tokens. To make this chart even more clear, you could probably just group things as a histogram, but I'm just doing the exact counts. So you can see we have most of our reviews being around under a 100 tokens. And then we have some outliers here, this one, and this large one out here has 739 tokens. So they must have been really jazzed, are really upset about whatever products they had. I'm guessing definitely if you outliers out here at longer reviews, but I would say they tend to be, it looks like around the 30 to 40 range of tokens, which actually pretty short for reviews. And, but remember this is the count of all tokens, punctuation stop words, it, Let's see what happens when we actually do the token count on the clean column or the preprocessed data, you can see that our count or outliers drastically dropped down from 700 and tokens to 271. You can see our average number of tokens that we have in our review is actually around ten to 15 or so, which really isn't as many as I would think for an Amazon review. Now, usually when you are looking into your data, you'll want to take a look at any outliers and see what they look like. So I'm just going to take a look at this review that has 271 preprocess tokens and I'm going to print out the actual original view tax. So note that since I'm printing out the review text, so this is actually going to be where there are the 700 or so tokens. But you can see someone who really had a lot to say about their burr grinder. And it looks like they actually came back several times and updated their review. So this wasn't just a single review is actually over time, which is interesting. You can play around and check out which ones you think are interesting. Maybe you want to look at the ones where there was only 10 tokens at the end, print those out as well. 11. Named Entity Recognition: Named entity recognition is an extremely powerful tool within natural language processing that allows a computer to be able to identify real-world objects within texted. Now Spacey has its own list of things that it recognizes. Anything from specific names of people, nationalities, countries, and really a lot of other options. And I'll link you out to their website to be able to look through that as well. The way spacey recognizes entities is through a pre-trained model. So that means someone has gone through an annotated are labeled a bunch of data where they take sentences or paragraphs of text. And for each token within that text, they might have Jane Doe labeled as a person or Japan labeled as a country. So with using a pre-trained model, it is important to note that it's not going to be a 100 percent accurate if your text is wildly different from the texts that it was trained on, it's probably going to recognize a bunch of garbage. And we'll see a little bit of that in our example that we're working with in the Amazon reviews. Overall, it does usually a pretty good job on generic text. One real-world example of when you might use named entity recognition is to mask any sensitive data, such as names or social security numbers or phone numbers. So next up we will go through an example of named entity recognition. Basically makes this super easy, which is awesome and already has a pre-trained models. And we can pretty much just apply this straight to our data. So just to give an example before we apply this to the entire datasets, I'm just going to take this random index of a review and save that off and also print that out. So we'll see what that looks like. Then what we're gonna do is print all of them spacey recognize from it as well as what character it started out in the text to what character ended as well. We can see that this reviews talking about some skill it to mean read through that if you want to. But the things that SPAC recognized out of here, we're numeric values, so 1 and 2, That's what cardinal means it. And it recognized Amazon as an organization for 100 degrees as a quantity and the silicone grabbers as FAC. I'm not really sure what FAC means, but let's go out to space. He's documentation and see all of the entities that it will be able to recognize. And then we can figure out what FAC means. Fac it looks like from space He's documentation is a building, airport, highway or bridge. Now if we go back to our data, we'll see you silicone grabbers definitely isn't the highway or bridge. It looks like it's the name of the product or the type of product potentially. So it's interesting that space he recognized that, but that's kinda what you'll find was bases named entity recognition as it was trained on a specific set of web data. And potentially because of the way this capitalisation looks to it, It might be thinking that that is a building or something, but apparently ever seen something in the data before that it makes it think that it's a building. Recognize that whenever you are doing Named Entity Recognition, it is model-based, so you're not gonna get a 100 percent accuracy ever that are going to be some mistakes. And we'll notice that a little bit when we start digging into the data more here and apply this across our whole dataset space. He recognizes a lot of different types of entity is actually, you can generally think of entity as a proper noun, a person, place, or thing. But let's take a look at all of the things that species recognizes so we can see they recognize person. Do they give you a little description of what that is? Real people and fictional people as well, which is interesting, Got nor witches, nationalities, religious or political groups, a GPE countries, cities, states, products, documents made into laws. You can see there's a lot here space. He also has a really cool visualization of these entities that it recognizes. And I'll show you what that looks like here. Basically, what it's doing is taking this texts that we had printed out earlier, the full tax as well as the entities recognized. And it's just making it a little easier for us to read. It's taking their review and then highlighting the entity is that it recognizes in there. This might be helpful if you're trying to create some sort of a website where you're doing Named Entity Recognition and you want to display it easily to the user. This is a lot prettier than this print out up here. Now moving on from that one example, we want to apply this to the entire DataFrame. And the way that I'm going to do that is actually create a new DataFrame just for because there can be multiple entities that are recognized within a single review. And I want to separate those out from our other DataFrame that we're working with. I'm creating a DataFrame called df underscore entities. And I have the following columns indexes. So that's going to match up if I ever wanted to merge back my results to the original DataFrame, dots, what would be the join? We have the SPAC doc, which essentially is the review taxed with the entity texted. We have the entity label, we have the entity start and the entity. And what I'm gonna do is loop through my sampled DataFrame that we've been working with so far, doing this inner tuples, this just means go through each row within the dataframe. And then for each row, we're going to be looking at the review text and saving off all of the entities that it recognizes. I'm going to run this code and then we will check out the first five or so rows of the DataFrame. If you're working with a very large dataset, this is definitely going to take some time because space he has to do a lot of things in the background. So let's take a look at what this DataFrame actually looks like. So these are the first five rows. We've got our text here are spacey doc, so this has been a review text. This one's talking about purchasing that thermometer. Then we've got the texts that is recognizing what is recognizing it as. And then if we needed to go back and do some analysis, it has the starting character in the text as well as where that character ends. And if you can see in this specific example, at least for the first five rows, we recognize the weekend as a day. We've got some numeric values as cardinals, the weekend again as a date. So let's take a look and see what the most popular entities you recognize. We're looks like cardinals the most. So that's people talking about like 12 different numeric values. We've also got a lot of dates recognized. We've got some organizations. He bought some work of art, so I'm guessing those are probably not actually work of art. So let's take a look at what products are recognized. Since we are actually looking at product reviews, you would think there'd be a lot. So what I'm gonna do is just subset my DataFrame or filter it down to the rows where there had been products recognized. And then autoimmunity here is just the value counts to see how many of each product we are seeing. So let's take a look. We've got all clad course sea salt, cuisine, art. And it looks like a lot of things that don't make sense. Like I said earlier, this is model-based, so it was trained on different data and not on these product reviews. If we had enough data that we had annotated ourselves, we could definitely train and model to potentially do a better job at recognizing entities. But getting labeled data usually takes a lot of time or you have to spend money to get someone to label the data for you. So sometimes you just have to work with what you've got any kind of filter out the results from there. So we can change this to look at other entities that it recognized. For example, let's take a look at the people that I recognized it. They'll be interested in peer reviews are calling out people actually looks like here. It's not actually people being recognized though. A lot of things that don't make sense here. I'm going to try that work of art just for fun. I'd be highly surprise if there are works of art being mentioned in homing kitchen reviews. So it's funny because you can kinda see where spacey might get confused here. Joy of Cooking and by Rome, bar and Becker in 1975, maybe that is a book or something, but I'm actually, I'm actually not sure if that's a real or not, but you can see that some of these, how to make sense where it might be a work of art. Let's take a look at or as well and see what organizations where mentioned. Amazon, amazon.com. It's definitely makes sense since these are Amazon reviews I would expect size. Now notice with these dot-dot, a lot of these are getting cut off. So you definitely save this office and other DataFrame and print out the whole DataFrame if you wanted to get all of the results here, feel free to go through any of you and look through a entities are being recognized and you might be surprised what's in there. 12. Part of Speech Tagging and Dependency Parsing: Part of speech tagging is a way to identify different parts of speech for each token. Now if you remember back in maybe freshman English class when you had to learn all of the different parts of speech and diagram sentences, all that fun stuff you probably forgot to. You're going to need to remember when you start talking about part-of-speech tagging. Now, different parts of speech, basic ones are like nouns and verbs, adverbs it, but there's a ton of different parts of speech out there. And based on where a token is within a sentence, you can identify what part of speech it is. Now you might be wondering, why would I ever need to know that the part of speech, different texts to, well, it's often used behind the scenes to be able to lemmatized tokens as well as do named entity recognition. But a cool use case is for a translation. So if we ask the computer to translate the following sentence, the Spanish, can you throw this canon the trash it. We would need to probably use part-of-speech tagging to be able to recognize the first instance of CAN is a verb to be able to translate that correctly. And then the second instance of CAN is a noun similar to part of speech tagging, dependency parsing. If you're going to be doing this and natural language processing, you're probably going to need to go back to freshmen year of English and be able to remember when you diagram sentences to identify where the root of a sentence is, what the nouns object is, what the modifier is, all of that stuff. But really what dependency parsing is doing here is just analyzing the structure of a sentence based on how the words relate to each other are dependent upon one another. This is often used behind the scenes really for lemmatization and named entity recognition, as well as other tools to be able to identify the relationships between words or tokens. Last but not least, we'll be touching on part of speech tagging and dependency parsing the person when they go out to this link that has a bunch of parts of speech from Wikipedia. If you need a refresher, feel free to come out here and take a look in English, our Western language, we have a few of these different parts of speech. Noun, verb, participle, article, pronoun, preposition, adverb, conjunction. But depending on what language you're working with, might be different for you. So like I said in the slides present, speech tagging is usually used to behind the scenes to be able to get named entity recognition to work. But I'll give you an example of where we might use this in the real-world. And first, let's see how we do this with spacey. I'm just taking one of those example reviews that we've worked with so far, this ID. And then assigning that to the stock variable. In there I am printing out the token text, the token part of speech, and the Token dependency. So this also gets into dependency parsing at the same time. And you can see this is kind of hard to read. We're seeing that these is labeled as DET and ET. We've got things which is a noun or noun subject. You can scroll through for the entire review and get all of those and it's kinda hard to read. So space, he does have a cool way of visualizing their dependency parsing and part of speech tagging. Which makes us little tree, which definitely takes you back to when you had to diagram sentences and shows you how all of these Unrelated to each other. So one example of where we might use this in the real-world is counting up how many adjectives and adverbs people are using in their Amazon reviews to see how descriptive their reviews are. And then potentially we can see what are the most descriptive reviews and what are the least descriptive. So I'm just going to take an example document that we've been using throughout this course and save that off as DACA. And just to get us an idea of what this might look like, I'm only going to print the tokens that are ADJ for adjective or ADV for adverb. But he can see earlier if I scroll back up here, we had a ton tokens, print it out here because we were printing out literally every part of speech and dependency, that space effect organized here we're just limited it to the adverbs and adjectives. So that's just an example for one Amazon review. But if we wanted to apply this to the entire DataFrame, what we could do is just make a function, I'm calling it count adverbs, adjectives. It takes in a SPAC doc, returns back an int or the count of the adverbs and adjectives. This way we can see what are the most descriptive text. So it's like I'm doing here is just setting up a counter that's equal to 0 and then taking that code pretty much that we wrote earlier, except instead of printing, I'm going to do is add one to the counter if it is an adjective or an adverb, and then return back the number of adjectives or adverbs in the text. So once that is run, we can use a lambda function like we've been doing throughout the course to apply that to the DataFrame, save that off as a new column called count adjectives, adverbs. And then let's actually do a plot to see how many adjectives or adverbs people are generally using in their text. Now you can see there is, again, always gonna be some outliers. So it looks like the one outlier is a 125 adjectives and adverbs. So that must be really descriptive texts. And it's probably a really long piece of text and we'll take a look at what that looks like it. Judging by this chart, generally, people are in the one to 10 range of adjectives or adverbs, which makes sense probably for our dataset. And finally, what we can do here is take a look at the one that was our outlier. But again, you could go through any of these, replace 125 would say 24, and it will bring back all of the reviews where there were 24 adjectives or adverbs, can actually see that this one that had a 125 adjectives and adverbs was the same example that we had earlier of the one that was a lot of tokens like 700 or something for the unclean version and then 200 for the clean version zone. Generally, you would probably think that as tax gets longer, you would have more and more adjectives and adverbs being used to and more description of the product as well. But you can also test that hypothesis out yourself. So that wasn't natural language processing using Python. In a nutshell, we covered some preprocessing techniques, term frequency, part-of-speech tagging named entity recognition and dependency parsing, which was definitely a lot to go through. And this is just the tip of the iceberg for natural language processing. There's a lot of stuff out there that we didn't actually cover it today as so, be on the lookout for a future videos. And we hope you enjoyed this video today. Thanks so much for joining us and learning about NLP.

Natural Language Processing with Python

Data Science Rebalanced, Data Scientists

Watch this class and thousands more

Watch this class and thousands more

Lessons in This Class

1.

Class Trailer

1:47

2.

What is NLP?

2:10

3.

Course Overview and Tools

1:37

4.

Load a Jupyter Notebook

1:51

5.

spaCy

2:14

6.

Python Libraries

0:39

7.

About the Data

3:52

8.

NLP Terms

5:03

9.

Preprocessing Text Data

12:09

10.

Term Frequency

9:47

11.

Named Entity Recognition

8:29

12.

Part of Speech Tagging and Dependency Parsing

6:52