Twitter sentiment analysis & Natural language processing (NLP) for beginners | Engineering Tech | Skillshare

Playback Speed


  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Twitter sentiment analysis & Natural language processing (NLP) for beginners

teacher avatar Engineering Tech, Big Data, Cloud and AI Solution Architec

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

8 Lessons (33m)
    • 1. Introduction

      1:11
    • 2. Converting text to numeric values using bag-of-words model

      4:31
    • 3. tf-idf model for converting text to numeric values

      4:11
    • 4. NLP core and Building a text classifier

      10:07
    • 5. Applying for a Twitter developer account

      2:21
    • 6. Twitter sentiment analysis using the text classifier

      5:37
    • 7. Creating a text classifier using PyTorch

      3:32
    • 8. Creating a text classifier using TensorFlow

      1:43
  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

13

Students

--

Projects

About This Class

There are on an average 500 millions tweets per day! People tweet on various topics , issues ranging from politics, sports to movies to almost every topic under the sun. Sentiment analysis is the process of determining whether a piece of text (review, tweet, feedback etc.) is positive or negative. Sentiment analysis helps us in getting customer feedback on certain products or services. It is used get the general mood of the public on various day to day affairs. Sentiment analysis can also be used  to predict election results.

In this course you will learn the following

  1. Converting text to numeric values using bag-of-words and tf-idf models
  2. NLP core techniques - stop words, stemming, tokenization
  3. Building a text classifier using Machine Learning classification techniques
  4. Exporting and deploying the Machine Learning models
  5. Setting up a Twitter developer account
  6. Fetching real time tweets from twitter and predicting sentiment.

Prerequisites:

You should have prior knowledge in Python and basic Machine Learning techniques such as Classification

Meet Your Teacher

Teacher Profile Image

Engineering Tech

Big Data, Cloud and AI Solution Architec

Teacher

Hello, I'm Engineering.

See full profile

Class Ratings

Expectations Met?
  • Exceeded!
    0%
  • Yes
    0%
  • Somewhat
    0%
  • Not really
    0%
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Introduction: Welcome to this Twitter sentiment analysis course. In this course will be fetching real-time tweets from Twitter and predicting sentiment of tweets using natural language processing and Python machine learning techniques. Will first understand the classification techniques and build a text classifier which can read any text and predict whether the sentiment is positive or negative. Once that is done will deplete the mortals for Twitter sentiment analysis. This course is designed for someone who already knows Python machine learning and wash to understand how to do text classification and apply various NLP techniques to do Twitter sentiment analysis. If you're completely new to Python and machine learning, you may want to check out our other course which is designed for absolute witness. So let's dive in and get started. 2. Converting text to numeric values using bag-of-words model: All machine learning models are designed to work on numerical data. If you have numeric data, agent salary is shown here, then we can easily build a machine learning model which can predict the output for a new set of data. Now how do we apply that technique to classify it text? For example, we could have reviewed data for a restaurant like services good or ambiances really nice, hard. We categorize them as positive or negative reviews. If we're able to build a classification model based on this review data, then we can predict whether a new remove, for example, main course was nice, whether it is good or bad. The problem that we need to solve is how do we convert this? Takes data to numerical format. This takes us to natural language processing or NLP. It's an area of computer science which deals with interaction of computer and human languages. Nlp can be used to process text or speech. One of the ways to convert takes to numeric format is by using bag of words model you represent text is bag-of-words, disregarding the grammar and the order in which they occur, but keeping the multiplicity, you give higher weightage to award if it occurs more number of times in a particular sentence. Let's understand bag-of-words through a simple example. We have three sentences. Service, good, nice ambience, good food. Now let's see how we can represent them in numeric format using the bag-of-words modelling. Let's identify all the words a peering in all three sentences. These are service, good, nice, ambience. And for now let's see how many times each word occurs in each of the sentences. The first sentence Service occurs once. So let's capture one. Nice doesn't occur in the first sentence. So let's capture 02. Similarly, you can do that for all the words in all three sentences. And then you can create a matrix of numeric values. Let's look at a slightly more complex example. We have three sentences, and these sentences have many word says shown here. The first sentences services good today, then ambiance is really nice. Then the third one is today for his coat and salad is nice. We'll create a histogram of words and capture how many times each word is occurring. When you convert a sentence to numeric format, you do not necessarily take all the words. You will have to find the top words and then create a matrix out of that. There are various libraries available for you to pick Top 1000 or 10 thousand English words for your text and create a numeric vector. For now, let us try to understand how the model is created by taking these simple examples and then picking tough power four or five watts. When you start working on actual NLP project, you loved libraries to help you extract the words and create numeric vectors. So in this particular case, we have arranged word by word count, and let's pick these five words. Is good, nice today in service, which occur most number of times. And let's pick this top five watts which occur more number of times, and then build a numeric vector for our three sentences. So as you can see here, what Issachar twice for the third sentence. So that's where the value is two here. For rest of the sentences, it is occurring once, so we're captured one year. Similarly, count of number of times each word is occurring in each sentence is captured here. The limitation of bag-of-words model sees each what is given the same importance. If you have to do some analysis using text, for example, if you have to calculate the sentiment of the text, not all words might at the same impotence. For example, words like nice will have higher importance than today when it comes to positive sentiment analysis. Let's now look at another technique using which we can give higher importance to certain words. 3. tf-idf model for converting text to numeric values: Tf-idf is a popular technique to convert takes to numeric format. Tf-idf stands for term frequency and inverse document frequency. It's put this model, if your word occurs more number of times in a document or a sentence, it is given more importance. However, if the same order occurs in many sentences or many documents, then the word is given less importance. Let's look at an example. Tf is Term frequency, that is number of occurrences of a word in a document divided by number of words in that document or sentence. For example, if today food is good and salaries Nice. That's a sentence. Then the Term Frequency of what the good is one by eight because the word good occurs once and there are total eight words. Similarly, the target frequency of word 0s is two by eight because the word iz occurs twice. And there are properly towards. So going by this model would easily have higher importance than we're good because it is occurring more number of times in this particular sentence. However, if toward easy common watercress, multiple sentences are documents, it's importance would be lower. So that is driven by inverse document frequency, which will look next. Idf Inverse Document frequencies calculated based on this formula. Log base C, number of sentences divided by number of sentences containing the word. Again, you don't have to remember this formula. You love libraries available to calculate TF and IDF values. For now, understand the concepts. Let's look at a simple example to understand IDF. Imagine we have three sentences. Services good. Today ambience is really nice, and today food is good and solid is nice. We already know how to calculate frequency of different words appearing in these sentences. Now to calculate inverse document frequency will have to do log base C, number of sentences. That is three for all the words divided by number of sentences containing the word. For example, eases a peering in all three sentences. So in the denominator we have three for each than log base e, three by threes 0. Now, the word Israel have lower importance because it is a commonly occurring words. Similarly for word good, it is occurring. And to document, If we apply log base e three by two, we will get a very low point for one. And then we can calculate for all the words. Service occurs only in one sentence or one document, so its value is 1.09. To calculate numeric value of each word, we take into account both TF and IDF. Simply multiply TF, IDF, for example, for what is TAP is 0.25 and IDF is 0. Similarly, you can calculate TF-IDF value for all the words. Now you can see that words are given importance based on how many times they're occurring in a sentence and how many times they're occurring in all the sentences. Unlike bag-of-words model, we give more importance towards which occur more number of times in a particular sentence, but they are lists spread out. This is TF-IDF model using which you can convert takes to numeric format. Now once you have this text in numeric format, this can we fit to a machine learning model? Each of these words in a text based classification system would be a feature or independent variables. And your dependent variable would be whether the sentiment is positive or not. That can be represented in numeric format is one or geo instead of positive or negative. 4. NLP core and Building a text classifier: Let's understand how to build a text classifier using the techniques that you've just learned will also understand some of the core concepts of NLP or Natural Language Processing. Go to Google collab and create a new notebook. We'll call it text classifier. There are various libraries available for natural language processing. Will be preprocessing our text using a popular library called NLTK. Will understand NLTK and some of the core concepts of Natural Language Processing by looking at some examples. First, we need to import NLTK. After that, we need to download NLTK libraries and will download all the liabilities. While it is downloading. Let's look at the text file that we'll be working on to understand NLP and build a text classifier. Will be looking at this restaurant review data. This is available on Kaggle and many other places online. This is restaurant reboot data and whether customers like the restaurant dot naught one means they've like did gentlemen step not like. You can see some of the positive sentences like the phrase, we're good. That is one that is positive. Who would not go back? That's a negative sentence, that's a negative review. So that is marked as 0. So based on these data will have to build it text classifier, using which we can predict whether a new sentences positive or not. We'll click on the tab to get the path of this file. We need Pandas to load the file. So we'll first import numpy as np, then Pandas as pd. Using pandas read_csv will read this CSV from our GitHub repository. We got an error because this is not Comma-separated, tab-separated, so you have to specify that delimiter. So the delimiter would be tab and then capture coating equal three, which means double-quotes should be ignored. Once it is loaded to a Pandas DataFrame, we can see the top records. Now this restaurant advertisement loaded to a Pandas DataFrame. In natural language processing, we remove some of the commonly occurring words lake, even though they might not tell us whether a sentence is positive or negative, but they would occupy space. Those words are called stop words. And using NLTK, we can easily get rid of all the stop words. There is another concept called stemming using which we can derive root form of words. For example, for both running at run, we can have word run for totally and total. We can have worked total. That we, we limit the number of words in our analysis. Let's understand how that would work. First, we'll import stopwords library from NLTK. Then we'll import porter stemmer, using which you can derive route for the words, will instantiate the stemmer class. Now let's look at our dataset in detail. It is 1000 entries, will have to loop through these touted entries and remove all the stop words and apply stemming and create a corpus of clean tech. First we'll declare a empty list which will contain the corpus of text. Now for i in range 0 to thousand, we'll declare a customer review variable which will contain data for each row, which we can fetch using dataset review I. Next, we'll get rid of all the stop words and apply stemming using this syntax. So we'll get all the words which are there in customer review. And if the word is not in the English stopword list off NLTK library, you apply stemming. Then you can concatenate the words to get the sentence back. And then finally, we'll append that to the corpus list, will also do some further data cleaning. If we look at this reboot, there are certain characters like exclamation mark, which we can also get rid of using Python. Regular expression will keep only alphabets in smaller capital letters. And you can easily do that in Python using regular expression. And the syntax for that is something like this. Should, this should get rid of all the characters which are not alphabet and will also convert all the sentences to lowercase for consistency. Now we'll split the sentence on space to derive the words. So the first line is removing all the junk characters. Then we are converting the sentences to lowercase, splitting it by space. For each word. If it is not in stopwords, then we are taking that word and applying stemming. And then finally we're joining all the watch to get the sentence back. So let's run it and see the output. We need to import the regular expression also. This has to be lower. Now after this, we should have a corpus of clean sentences. Let's check out the values. We'll take the first sentence is you can see now we have all the dots removed and the entire sentence convert it to lowercase. Let's say chord line seven, which is an index six. You can see the parentheses have been removed. And also all the stop words like a in the and other commonly occurring words have been removed. And the tamer helped us derive the root form up each word. Let's look at another example. So this is another sentence where words have been changed to their root form. Note that the root form may or may not have any meaning. But then that would help us reduce the number of words so that we can do the processing much faster. Next, let's convert the sentences to numeric format using TFIDF vector treasure. Scikit-learn is it TFIDF vector Egypt class. And we can specify how many words we want, tau 01500 or whatever number. Using mean DAF, we are specifying that the word should occur at lease price for that to be considered. So you can get rid of words that are. Cutting infrequently using the mean df. Using max D if you can get rid of words that occur frequently in all the documents. So for example, MAX da 0.6 would get rid of any word which occur in more than 60% of the documents. Next, using the vectorized or we can convert the corpus to a numeric carrier. Let's print takes now. So these are the TF-IDF values. There will be some nonzero values which are not displayed in this notebook. Let's check a sample record. And we can see that some of the words have nonzero values. So this victimizer is created a two-dimensional numeric carry from all the sentences in the restaurant review file. In this dataset, like does the dependent variable which contains one or 0. So let's create a dependent variable, y, which will have data for this column. So we'll get all the rows and the second column, convert that to a NumPy array. And when you print y, you can see all the values one or 0. After this, the stapes to create machine-learning model is same as what we have seen earlier for numeric data. We'll do train test, split, keep 80% data for training, 20% for testing. Let's use the K never technique to build a classifier. So you can also use any other classification technique like maybe so which is a popular classifier for text-based data. Now let's predict using the classifier. Let's derive the confusion matrix. Will now print the equity issue. Next, let's have a sample sentence and predict whether it is positive or negative. We use the same vector leisure to convert this sentence to numeric format. So this is now the TF-IDF representation of the sentence. After that, we can predict the sentiment using the predict method of the classifier. So we got one which is positive. Let's have another sample sentence. Convert that to TFIDF format. Now predict the sentiment and we got to 0. So this is a negative sentence. This is how we can build a text classifier which can read different sentences and determine whether it is positive or negative. Now if anybody wants to predict using this classifier, they would need the classifier. They would also need the victory measure. Let's export this two files in pickled format. So this is our classifier. We'll call it text classifier. And we'll create a pickle file for that TF-IDF model. Now we have both the pickle files and we can download from the colab environment and take it to another environment where we can use this buckle files to predict sentiment of text. 5. Applying for a Twitter developer account: Let's go to developer dot twitter.com and apply for a developer account. So this is different from dot to.com that you might have. Pashtun could log into Twitter and then go to their law partner tutor.com. Click Apply. Click on Apply for a developer account. I'll start doing academic research. And give all your details. Specified the reason for creating a developer account which will give you access to data. Epa answered this video's questions. Click Next. Read the terms and conditions and click Accept. And submit the application. You need to go to your mailbox and confirm that you've applied. Now there will be with application and approve it. It might take a few hours or up to a few days. And you will get a e-mails getting that your application has been submitted for review. Once your application is approved, go to developer dot Twitter.com. Click on developer portal. Then here you can click on apps. And you can clear the nap. Give it a name, give it a callback URL, which can be same as your URL. And other details. One stop is created. You can go to keys and tokens and get your consumer EPA's and secret key which you can use to retrieve two. It's you can always go back to the apps and select a particular app, and go back to the keys and Tokens tab to see the keys. And you can also Region part of jumbling, putting somebody knows your keys, then you can always read them. And you can generate access token and access keys. And you can only see these values once Sudipto copy and give them somewhere. 6. Twitter sentiment analysis using the text classifier: Let's now go to the text classifier notebook on Google Columbian download the pickle files that we generated in the previous level. First we need to import the file stability. Then we can save file store download and specify the file name in courts, and download the bcl files faster download the classifier. Then we'll download the TF-IDF modelling will upload the pickled files to GitHub repository. Now let's create a new notebook for Twitter sentiment analysis. We'll save this. We'll name it as Drew doesn't demand analysis. This is a new notebook, so the pickle files will not be present here. Will copy them from GitHub repository. Copy link address. Then, first get dot TF-IDF model, Copy link address, and then get the text classifier. Now both the files have been copied. To do Twitter sentiment analysis from a Python program will use to be liability. First clinically important 3p. Then we need to declare forward variables to store the consumer key, consumer secret, access token and access secret. Let's copy them from our developer account. We'll select the app that we just created and copy this key secretes and access token and access secret and regenerate these keys. After this lab, you will not be able to use these keys. Next, we'll write those turned out to be core to get outraged to Twitter using the consumer key consumer secret, access token and access secret. Next declared an APA variable with certain timeout, specified 22nd timeout. If there is no tweet for 20 seconds, then it will timeout. Next, let's fetch tweets for a particular text. Will be fetching for vaccine, which is a popular topic. Now we'll create an empty list to store all the tweets. And then using standard 2pi chord, we can fetch all dot, which the only thing that you need to pay attention to is how many tweets you want to fetch have specified 500 here. This will keep running until it reaches 500 tweets. You can verify the length of number of goods, phase two, which is 500. You can check some sample two, it's also, so these are some real grids that people are tweeting right now on covert vaccine. As you can see, that tweet said lord of special characters like cohosh. At the rate, we can use Python, relay, periodic let-expression, two pins that weights. So we didn't really look. We'll get tweets one-by-one, converted them to lowercase, will remove all John characters. You can read more about regular expression and understand how to deal with different types of text. We can take a sample to eat after cleansing. Let's check out this one. See that it's gained at all. The special characters are gone. We have learned videos techniques to deploy the pickled files, like having risky IPAs are serverless EPAs for this lab, let's simply Lord the pickle files to two variables and use them to import topical. And we lowered our TF-IDF model to another variable. Let's declare two variables to keep track of positive and negative tweets. Next we look to the Twitter list and using classifier dot predict method will predict sentiment for each tweet. And before fitting that takes to the classifier will have to apply the TF-IDF model to convert it to numeric format. Let's run this. After that we will get the positive and negative uidCount. Let's see how many positive but two, it's on vaccine, it's 97 and then 403 negative two. So this is the sentiment of the text analyzed for last 500 tweets. 7. Creating a text classifier using PyTorch: Let's now understand how to create dot text classifier using by touch. If you are new to buy it dot-dot-dot deep learning, you can check out our other course on machine learning, deep learning model deployment. The stapes for text preprocessing and cleansing is same as what we have done earlier. Once you have the corpus obtain text, you can use TFIDF vector Asia to create a numeric array. And then after that you can do train test split using scikit-learn. After that, instead of creating a model using k-nearest neighbor technique, we'll use Python to build a text classifier. Import the required liability for touch. You need to convert x and y variable to tensor format. One thing to note here is we have total 1000 sentences in the corpus. They have 467 features. So these are the vectorizes towards now that determine our input node sites will have an input size of 467, because there are 467 watts or features in this corpus of text. Output size would be two because you are predicting the sentiment is positive or negative. Well, we can try with different hidden size. Let me try with 500. Similar to the previous example, we have two hidden layers, will have three fully-connected layers, input to hidden, hidden to hidden. And then he did do final output. So the only change here is the input size handler hidden sage. Rest of the steps are discussed earlier to define the model class. Then you define the optimize your learning rate. Let's say a 100 epochs this time. And now let's train the neural network. You'll see the loss is getting minimized. And now the model is trained and ready for prediction. We can predict the way we predicted our earlier. Let's have a sample sentence will convert it to numeric format. And we need to convert that sentence to dodge denser format. After that, you can predict using the Python model class. From this output, we can see that it's a positive sentence because the second element is higher than the first one. If we have another sentence similar to the one that we had earlier, which is a negative sentence, then will get the output in which the first element will be higher than the second one. So this is a negative sentence. Now you can export the dictionary and integrated with the tutors sentiment analysis program. If you are more interested in learning how to deploy by touch Madelon, How to Create risk GPAs from your high-touch model. Then you can check out our other course on machine learning, deep learning model deployment. 8. Creating a text classifier using TensorFlow: Now let us understand how to create a text classifier using TensorFlow cameras. So once our data is ready, then we can create a TensorFlow model. Similar to the earlier examples, we'll create two hidden layers and one output layer will have 500 nodes in each hidden layer and intensive loci. Whereas you do not have to specify the input layer because it will automatically determine that from the input data. Now let's train the model with a 100 epochs. What's the model has been trained to be? Can take the loss and, and also take the model somebody. Now we can predict the way we predicted earlier for kNN or Python models. Have a sample sentence. Convert it to numeric format. Then using TensorFlow model.predict method, you predict this intimate. It is 0.79. So it means it's a positive sentence. Similarly, for the other one, we've got a very low number exponent cell minus 07. So that's a negative sentence. Now you can save and export this model and integrate with though sentiment analysis program. If we're more interested in knowing how to create risks EPAs for TensorFlow models on how to deploy a TensorFlow models. Then you can check out our other course on machine learning and deep learning model deployment. Thank you for enrolling for this course.