Sentiment Analysis Using Python | Mayank Rasu | Skillshare

Sentiment Analysis Using Python

Mayank Rasu, Experienced Quant Researcher

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
14 Lessons (2h 20m)
    • 1. Why Sentiment Analysis

      6:00
    • 2. Sentiment Analysis - Intuition

      8:09
    • 3. Natural Language Processing Basics

      17:16
    • 4. Lexicon Based Sentiment Analysis

      6:36
    • 5. VADER Introduction

      11:35
    • 6. Textblob Introduction

      7:41
    • 7. Building a Sentiment Analyzer using VADER - Part I

      10:35
    • 8. Building a Sentiment Analyzer using VADER - Part II

      18:21
    • 9. Machine Learning Based Sentiment Analysis

      12:24
    • 10. ML Feature Matrix & TF-IDF Introduction

      9:50
    • 11. Building a ML Based Sentiment Analyzer - Part I

      3:16
    • 12. Building a ML Based Sentiment Analyzer - Part II

      16:59
    • 13. Building a ML Based Sentiment Analyzer - Part III

      6:31
    • 14. Sentiment Analysis Application - Opportunities & Challenges

      4:56

About This Class

Build an automated system for analyzing text documents and finding the polarity of the documents. This course delves into the evolving area of sentiment analysis. The course starts with the basics of sentiment analysis and natural language processing and covers both lexicon based approach and machine learning based methods of sentiment analysis. By the end of this course you will be conversant with popular python libraries such as NLTK, VADER, TextBlob and Sklearn and should be able to build a sophisticated sentiment analyzer with reasonable accuracy.

You can expect to gain the following skills from this course

  • Data mining using web-scraping

  • Natural language processing basics

  • Lexicon based sentiment analysis

  • Machine learning based sentiment analysis

  • Using VADER and TextBlob libraries to perform sentiment analysis

  • Naive Bayes algorithm

  • Building machine learning based sentiment analyzer

Course image by courtesy of rawpixel.com

<a href="https://www.freepik.com/free-photos-vectors/heart">Heart vector created by rawpixel.com - http://www.freepik.com</a>

Transcripts

1. Why Sentiment Analysis: Hello, friends in this section, we're going to gain an understanding about sentiment. Analysis. Sentiment analysis is also called opinion mining or polarity detection. What we're trying to do in sentiment analysis is that using algorithms are different methodologies to extract the polarity off a given document and my polarity. I mean, whether document has a positive message are negative message or a neutral message. So we're human beings. We we can very easily understand human languages. But for a machine to understand whether the sentiment from ah statement is positive or negative, it's not that straightforward. We'll have to use algorithms are different ways to, you know, achieve that on sentiment analysis. These the doctor sentiment analysis is a field, but then Natural language processing, or NLP NLP is an area but massive growth. It's it. It is gaining popularity with every passing day, and you can see a lot of research happening in this area Already. It is a multi disciplinary field consisting off computer signs, mathematics and even linguistics, for that matter. And you'll see a lot of computer science research in the 80 off NLP on the application off NLP mean we are seeing that in Ah, every day in our everyday life if you're using. Ah, the famous artificial intelligence agents like Alexa or Cortana or Siri. Oh, my heart. My Alex. I just fired up in the background. Okay, Anyway, so yeah, if if if you're using these Ah, Ah! If you're using these artificial intelligence based agents, you you are interacting with these agents using human languages right on the machine is able to understand your yours. Your language be accorded Onda. Perform some action, right? So and that is what NLP does. What NLP is does is that it tries to figure out human languages using a set of algorithms and sentiment. Analysis is a subset off, and I'll be so NLP and comm passes both machine learning and deep learning algorithms. I mean, they do intersect quite a bit on in this course, and in this section we will be delving into some machine learning approaches to sentiment analysis as well. Also, sentiment analysis can be bifurcated into two broad areas by which we analyze sentiment off statements. One is machine learning based approach in which we use machine learning algorithms. On second is a lexicon based approach in which, as you figure out which would be use a lexical graph, which means a dictionary based approach or corpus based approach, so I'll delve into both these approaches. Ah, in this, of course, I will briefly touch upon body sections on day. We'll see how we can use them to figure out sentiment inherent in a statement. So why should we do sentiment analysis like, What's the importance? A sentiment analysis has gained a lot of popularity in industry as it allows organization to mind customer opinion and act accordingly. There are computer science engineers who are parsing tweets or parsing Ah, Facebook. You know, Facebook hashtag sand cash tax and trying to figure out what the opinion about a given product or a given company is among its customers. And based on that, the complete Orient Ex policy. So in that way it's it's fairly popular and e mean, ah, a lot of people already using it. So it's in that way it's not. It's nothing new. It's fairly inexpensive. Ah, by the end off this course, you should be able to write ah, fairly sophisticated sentiment analysis script, which can extract or mine opinion from a large cool off data eso you can You can do all this in a fairly inexpensive they Earlier companies would call people horse elevated sector are to get opinion about something now. It can be done by mining data from various social networking websites of from various other sources news, new sites, etcetera and you can ah put that data. It was sentiment analyzer and you can get that information is that it is very, very inexpensive. It also affords you to accurately understand the collective emotions off a group which can lend you a massive competitive advantage special specifically in politics. We all know the controversy surrounding US elections in which Donald Trump won Onda. There was this company called Cambridge Analytica and they used something very similar. They used sentiment, very sophisticated sentiment analysis, um sentiment analysis algorithms to understand various voters and kindof Pune the messaging accordingly. So again, the importance of sentiment analysis cannot be over emphasized. It is an area which is growing in popularity on in finance. Also, we see ah sentiment analysis being used in various ways in finance and stock investing. It's still a relatively new area on duh kind of success off sentiment analysis in stock strategy. Stock trading strategy is still a matter of debate, but the idea is to give you an understanding of how sentiment analysis can be used in finance and how you can hopefully gain an advantage. Ah, then you are preparing a straight strategy based on sentiment analysis. 2. Sentiment Analysis - Intuition: Hello, friends. In this lecture video, I'll try to help you build an intuition around how an algorithm would approach, trying to understand, send Dimond awful language or a statement that it has no idea about. OK, so let's try. Do have Let's try to create a problem which you hopefully never get right to say, Oh, you want to impress Ah, one of her German friends on DA who and the German friend happens to be a movie buff. So you went to a German movie review website on DA. You looked at a view off a movie in the German language, right? So you don't understand a lick of German Onda. You have the task off surprising your friend by just going through this. Ah, movie review on DA figuring out whether this review is positive or negative, which means whether you are German friend should watch this movie or not. Okay, the rule is that there is no Google translator. There is no German to English dictionary, nor do you have an interpreter. You may think that if you are living in civilization, you will get any of these. But again, what we're trying to do with me. I'm trying to make you realize the predicament that the machine faces. It has no understanding of what human languages are. And then we told these really big documents at this at the machine in order for you to understand what these Ah, what is a sentiment off? Those particular documents are, but they're positive or negative. So how would you go about doing that? At this point, I want you to pause the video and think about it. How would you approach the problem? All right, So this is how this this may be one way to kind off approach this problem. Ah, in this way is fairly in perfect. So you have that website where you have all the German language reviews. Why don't you just go through a lot off other German reviews, right? And trying to figure out these some of these words? Some of the key words rates. For example, if I look in a dis review Ah, Looks like these words like nature off Hantman. Ah, you know, wars. Harris Barton, are you know I saw it for study for my German pronunciation and in your few German students . I don't want to offend you. But again, I'm trying on a Bishan. So these words look like important words, right? So you know, there are some character words in a sentence, right? So it looks like east and missed. They look like characters. And then there are some really big words which look like these words are the most significant in the sentence. So why don't I look at other reviews on Tried to identify if I can find these words in those reviews or not, and then see whether those reviews have a positive or a negative. Ah, you know what have you again? This is not a perfect approach, but this is one of the approach that you may think off, right? So So for example, I was able to find some, you know, some reviews but these keywords in that German language movie website And this is what I came a bit So in review where the word of Vore Hirscher bear was the review was 3.5. So it looks like this is not a good word, right? Ah, Mr News was 2.8 Ah, again not ah not a good word or are not a good sentiment for this world, right? Sean, I I know what Sean means. I know that my German Right. So this, um it means beautiful, I think. Yes. So that has 7.4 again. So this looks like a positive word, right? And then Bishan language. This has a rating off four bite, then And then finally and sun stand had a rating off. I point. So all in all, if you look at, you know, look at the ah rating off other reviews having the keywords in them, it looks like on the whole, it's not really a good movie, because the combination off all these interview was it's pretty sad, right? So, I mean, I can save it. Average level off confidence, Doctor, this review is not good. That means hey showed your German friends should not really watch this movie. So let's look at a Google translator again. I broke the rule. I just put this ah review in urge in and Google translator, and it's actually not a good review, right? So starts with a positive statement, but then it ends in negative north, like so these three slides. What I wanted to do was just build an intuition off What can be an approach off trying toe , you know, solve Ah ah problem. When you don't understand the language and that's what machines are subjected to, they don't understand the language on we ask it toe. Tell us what the sentiment is so like I mentioned in the previous lecture Video will be looking at both the lexicon based approach Onda machine learning based approach on. I'll be using these libraries as we go along on that you must already be be a bit annoyed because I've been talking a lot and have not yet shown you a court snipper. But don't worry. They'll be a lot of courts to discuss in this sentiment last section. But it's very important that you have a fairly good intuition and understanding off what sentiment analysis is trying to do. If you understand the building blocks of sentiment analysis you, you will be very easily be ableto you know, better. A fairly good sentiment analyzer. Okay, so the relevant biting libraries that I'll be using his NL DK and lt Gates probably the most popular bite and a natural language processing Ah library. It can perform organizations stemming limit division punctuation character count what counts sector. And don't worry about it. If you don't understand what bizarre, I'll tell you. I'll take you through all these dumb. So these are some of the things that we need to do when we're doing NLP on. I'll take you through all these concepts in the in the future. Radios. Okay. Ah, Then that is, um very, very good. Ah, sentiment and analyzing library called later, right. It's called Valence of their dictionary and sentiment. Reasonably, this was created by ah, Georgia Tech professors on I'll give you a research paper talks about matter and how it's apart. Breaking story based approach to analyzing sentiment. Great. Then we'll be using psychic learned. Obviously, secret learn is ah used for machine learning models, but we'll be using ah, psychic learned to implement machine learning based approach for sentiment analysis on. But it's expected that you have some understanding of machine learning, although I'll try to give you some basics basic understanding of machine learning, but will immensely be helpful if you have some machine learning background. Okay on, then finally explored text Bob is also very widely used sentiment analyzer. It is available in Brighton on text Block can help you perform both machine learning based on demand analysis on lexical, a graph based sentiment announces that way. Text blob is pretty popular. So in the next videos I'll take you to the Mexican based approach. I'll start with taking you through the concepts, the basic concepts that go into building Alexe second based sentiment, and later on, then we'll discuss ah, courts neighbor. 3. Natural Language Processing Basics: my friends. The first step in performing any text analysis is rectory ization. Beyond know that computers require data in given data structures only. For example, if you want to perform computational analysis, you will need to provide data in format like ah integer or float or list dictionary in beytin, right? Ah, So computers only taken data in given dear A structures and vectors and matrices are pretty popular in that regard in that we use vectors and matrices for performing a variety of computational analysis. And most of the machine learning analysis that you will come across will be based on ah ah matrix and vector representation off numbers. Right. So vaporization is the first on probably the most important step in before in textual analysis. Right on there are various steps that you need to perform when you are doing the privation . They are the organization which means splitting the text into relevant you units. For example, characters, words, phrases that sector our every single split id Ah, you know, entity iscar A token usually be performed organization by word. So, for example, if you have a paragraph ah, with 40 words separated by white spaces. If you perform the organization, you will get a vector off 40 individual words. Right? So you will get director or list off 40 words. So that ist organization. OK, then there is something called amortization and stemming Both limited ization and stemming are very similar concepts are the differences that limited ization reduces words to their rude words. Right? So they removed the inflection all form off words. Um so for example, of study studied studying, right? All these three words have the same rude word study. Right? So if you have ah study studying and studied all three words in a paragraph, we will reduce all three of them toe the same root word which is called study, right, but limited ization. The main important thing is that the rude word needs to be an actual English word. Great. So you have you need to have done some analysis beforehand. Very in you will decide are you will assign or you'll tell the computer what are the root words again. So it is not very easy because there are only certain English Ah kind of dictionaries which provide ah, this rude words so limited ization is not something that is very easy on because of that, we used some tingle Stemming, right? So stemming is a rule based kind off reduction. Ah, so the root word, right? So, for example, we know that in English B ard the suffix words with e b right stoop to represent past tense . Right? Hurry. Hurried, right? Edie, you're I e d. Likewise, we add s or es afterwards to, you know, to, uh, represent different wolf forms that sectoral. Right? So, uh, so when you when you are building a stemming algorithm, you just tell these kind of rules you are your algorithm, and then it if it sees any word that trailing Ah, you know, I es or trailing Edie etcetera. Then it cuts out that kind off that, Then it cuts out darts. Effects on just takes the stem. Right. And as you can figure out stemming sometimes I mean, if you're performing stemming the root word that you get may not be an actual English word , right, Because we're blindly applying. Um, like, you know, these rules. Ah, on that is applied toe any words of example. There may be a word that does end with E. D, but that you d doesn't necessarily correspond to a past past ends. Right? So again, things like that so stemming is kind of a lazy way off implementing limited ization. The both of them are trying to reduce words do their dude forms. Right? So this is what both limit aviation stemming does. And I'll show you an example in a short while. Well off how this thing happens, right? And then 4th 1 is stop word. This is very important. And I I use stop words a lot. So you know, in any language, we have a lot of connect the words rate, for example. Ah, the is stinks like that, right? So they don't really have meaning, but they're simply connectors, right? The connector words that secretary So we don't really want to spend. Computational resource is on assigning variable space to these connector words. Right? So we remove these kind of afterwards when you move these stop words from our analysis than before creating a vector right on. Then the fifth of fairly used step iscar normalization. A normalization is cleaning data bit things like Immortal Gougne's or extra punctuation marks at sectoral right, which which may be there in a in a document, but which may not really mean much, right? So v do perform normalization a lot. So once you perform all these operations, you will get a victory. Eyes output off a document which will be a vector off words. So let me show you in a biting script how we can very simply performed these steps in orderto convert a text document in do a vector of words. So this is a fight inscribed it I'll be using toe Explain how NLD library is used to perform mechanization natural language to work it probably the most used ah biting library for natural language processing. Most off the other libraries that will discuss, for example, that better and text blob. They do use an etiquette. So analytics is running under the hood. I'm not sure about about Rader, but text blob Definitely eso Analytica has a slough off functions and its delegates them off which we will experiment right now, which lets you perform text analysis in a very convenient way. So that's what energetic is used for. So ah, installing is it is fairly straightforward. All you have to do is do pip install and lt k. And it should install it, right? I already have it, so I'm not running it. So just, um, are the exclamation mark and and control Inter and you should be able to implement. You should be able to install an any TK Right. So once it is so, I already have an lt k in my system, so let's see how we can use it. Right. So this is the text that I am trying to wreck. Price. Right. So this is the text I have which says I am not a sentimental person, but I believe in the utility of sentiment analysis. Right. So it's pretty up. Ah, statement for this lecture. Right. So let's run it. So you have this. Ah, string. You have this floor as, um as ah ah ah invariably text rate. So the first step is to organization like we have already discussed. Token addition simply splits. Ah, this string. Ah, by white spaces, right. So on da more dool in an antique A that we can use toe to toe organizes Ah, Dr Organize and within two organized were using word underscore to organize You can also have sentence underscores stocked. Organized. If you have a paragraph that with multiple sentences so you can organize by any denomination typically it's word organization. Right? So this is what So it's pretty straightforward. Ah, weevils will import the board underscore to organize more you'll rate and then ah, we will just boss this text within this module and we will get the tokens. Right. So let me than this right? And if you see you have Ah, all it does is it is that it splits the sentence by the white spaces. And now Ah, Dawkins is a list off all the words that were in my sentence. Okay, pretty straight forward. Second is limited ization, right. So liberation. Remember, limited ization is when we try when we're trying to reduce all the sentences. All the words to the root word Late, rude word being an actual English word. Okay, so the more you'll from an anti gay that were you were going to use his dart stem, right? So dark stem is used for both limited ization and stemming right? Ah, for a liberalisation. We used word net lemma tizer Um mortal. Right. So it is it will use of ordinary lexicon. Ah, world net lexicon that has the entire morphology off English language with ah which means it has the list off the root words for all the words that even think off, right? So, Borden, it is important if I run this Ah, we are going to see that it doesn't run because I do not have a word net installed. So? So it does. So I was able to create this object. But Vince, I But once I tried to perform this limit ization on my tokens rate, I get another saying that resource ward net not found right. So, Warden, it is like a dictionary and you will have to downloaded. I don't really use it. I don't really I never use limited ization. In fact, I don't even use stemming. And I'll tell you why. And that's why I don't have it. But if you want, you can very easily, you know, download word net from the analytic. Get down or you'll get right. So if you run it, you would have seen that. Let's look a token. So had had be hard of the ward net. Ah, lemma Tizer, Right. Ah, you would have seen that the words like ah, um would have become be great. Um, what does sentimental would have become? Send demand right Things like that. Right. So this is Ah, In this case, it would not have made a lot of changes, right? Because most of these words are rude words. They're not infection words, But you get the drift. What limitation would have done? OK stemming would work, OK. Stemming. Good work should work pretty nicely. So we you was border stammer as a module to kind of perform stemming. Okay. Ah, again for stemming the first convert all the text into lure. Right? Because that's how it works. You need to convert it into the lower form. If you have capital, it will consider it as a new word on Dstets Ming will be a bit off. So it's a good idea. Toe convert your text to lure before applying stemming rate. Then you initiate the ports. Temer object. Ah, ports temer Obvious p s as a ports tomorrow object rate. And once you have it, all you need to do is ah pass. Ah, your tokens to this. Ah, stem er on. This will give you the stem boards. Right. So if I run this, you can see that. Ah, you can see that somewhat. These words must have changed. Let me see which of them have changed. So sentimental begins and demand. That's fine. That's something that be expected. You see? Believe here. They removed the e from believe. Ah, so this may be one of the logic that is encoded in analytic A That if something ends with V E remove E or something that ends with E V. I don't know. It looks like they have just taken off one e, right? Like vice for utility. They have taken off idea. Why? So I Reno in English? Ah, lot off words. We just add idea by, for example, chase Chasity rate. No, ive never the right. So we add idea why afterwards? So that's something that they have removed. Toe kind off. Come up at the stem. Okay. What does analysis they have removed s because they thought that it's uprooted Word again. So you see, this is not actually ah, this This may not always give you the correct or the in tender the kind off. Ah, rude word reduction. Right. But it's like a lazy way or because it already had a set off rules from English that we use . Look, I know for change was transformed words on that's ah, something that we used to reduce it in tow. The roadwork. Okay, so this is ah stemming and limited ation on. Let me qualify hair that I never use numbing and limited ization when I'm trying to analyze and demand for financial documents. That's because there has been research that that that has found that stemming and limited ization actually reduces your sentiment accuracy by 10 to 15%. And why is that? Ah, think off unexamined where promise and promising. Right. So they are two different. They have two different meaning. Right there they have you really different context. So if I say I promised you something on DSA, something is very promising, right? Both of them has promise as the root word, but the meaning is entirely different. So you don't really want to reduce promising to promise and promise to promise because thes two are having different meaning. Right? So that's the reason I never use limited edition stemming. However, if you won't do it, go ahead and do it. Ah, it's fine. But it will not impact your ah accuracy by much. But I have a preference off. Not using either limited this in our stemming. I have explained it because and no Ah, NLP lecture can be complete without explaining all these steps are limitations stemming a separate stop Words is the most important as the most us straight. So N l d j. Has a list off stop words in English, right? Which a connector? Words on. Let us see which are those works, right? So I think the list off. How many words are there? 1 79 words that they have, You know? I mean, just even just go through them. Most of them makes out pretty much all of them makes sense. Right? Um, these are connected words. This doesn't really have me import on the sentiment, but they're just kind off binding. Ah, the bigger words together. Right? So we have these stop words and we have stored the stop words in a variable. So are all I will do now is I'll run a loop on. I'll pass every single word off our tokens. Data set rate actually, let me republished tokens. So I don't want the stem diversion off my tokens and let me run tokens again. Right? So as I got tokens, right, So now all I need to do is on a loop. When I passed every word in token ah, to the loop. And this ensured that that word is not bear in the stop words less straight. So I'll run this. And if I see it has almost half trade your vector spaces Almost half. So you see, these are the words dark are the most important, right? So this is a somebody off the basic steps perform toe wrecked. Arise our text document that you have been given once we have ah ah victimized representation off the text. We can then move on toe the subsequent steps which has either lexicon based analysis or machine learning based analysis. Thanks a lot 4. Lexicon Based Sentiment Analysis: Hello, friends. Mexican based approach is quite popular, and most off the sentiment analyzer that he'll come across will be based on some form off Mexican. Okay, so this approach relies on an underlying sentiment. Mexican A. Think off it as a dictionary. So what? This lexicon is that it's a list off words which are labeled according to their semantic orientation or polarity. Right? So you will have a lexicon with a lot of words are running in thousands right on. These words will be labeled positive or negative. Typically, they are not label positive or negative negative, but they are. They have a new medical representation, which kind of decides whether it's a positive or negative on this ah numeric range. Usually that lies between minus 1 to 1. So, for example, a word like great or award like Excellent uh, the label or the score will be pretty close to plus one, whereas word like a terrible Richard object depressed will have ah score or close to minus one. Right. So doctors. Ah, so that's the kind off image I want you to have in your brain when you think off sentiment Mexican on I show you some off the Mexicans inverse of sequined lecture videos. I'll actually show you one on get hub for bidder and one for text. Glove is well, right, so I'll show you how it looks. But that's the image I want you to have in mind because on as you may appreciate that manually creating evaluating such a lexical is not very easy, right? It's, ah, fairly time consuming task on That's the reason we do not have a lot off sentiment. Mexicans right there, like a handful off popular lexicons out there. And most off the analyzers use any off these. A combination of these are one of these rates. So so some example would be yuk in. You sent the word nerd. Send Ignat on better and we'll discuss it later. Um, in the subsequent lectures, there's another item library that we will discuss and that's called text block. Text me up uses ward net. Okay, so that's what ah thes sentiment Mexicans are, and that's how they're used. Ah, there are some very prominent drawbacks off this Mexican based approach. It, as you may have figured out already the first is that deterrent, tend to suffer from inability to process acronyms, initial ism, emoticon, slags except prorate. On this, these are things that the are pardon, parcel off our everyday conversation. I see these very prominently in everyday language, right? So that's one thing it is unable to process. Ah, this kind off, you know, colloquial language, right? And second, ah. Drawback off most of the Mexicans is that it is unable to account for sentiment intensity, for example, good and exceptional. They will have very similar score. So it if most of the Mexicans do not really kind of fact in the intensity off the world straight on third drawback is that the Mexican based approach are unable to process sarcasm . I would argue lot of humans are unable to process sarcasm, right? So professing sarcasm requires a pretty high level of intelligence, and we cannot really expect a Mexican based approach to, you know, do even a reasonably good job off understanding sarcasm. Right? So these are some of the drawbacks that you'll have to live bit if you are using a lexicon based approach, right? Um, Vader is, ah, fairly recently developed Mexican by Georgia Tech Computer Science Department. Right at this particular Mexican addresses the top toe. The 1st 2 drawback stack we discussed about right. So Vader incorporates popular slang is like a well when grof l accept. Great. And it also incorporates emoticon. So when I'll show you the actual Mexican that waiter has, you will see that it also has these emoticons and swank's on these. Thanks anymore. The guns have been labelled accordingly, right, So that's great. On second, the feature that it has which other lexicons do not have is that it has ah fighter spectrum . So, for example, it has a scale off minus four to plus four. So this fighter spectrum gives us ah ah, big ah, higher degree to kind of factor in the sentiment intensity, right? These other two things Ah, these other two features that take your off the do drawbacks that we have discussed Um oh, great on DA. As a result of this Ah Vader's result have been very successful on social media data on That's not surprising because social media is beset with Kolodko language and emoticons emoticons particularly right. So that's the reason waiter performs really well on social media data, right? So I am going to share the paper. The card McPaper that accompanied the release off. Better. Ah, let let me actually show that paper to you. If you search Vader sentiment analysis, you should see the paper on the first page itself. Yeah. So you see the pdf off tookem? Science calm. Social Guard ticket. You right? So this is the actual paper, and I encourage you to go to the paper. Unlike other karmic paper, this is a very easy to read paper and take you through the methodology that they used you, actually, you know, build this Lexington onder how, while its performance better than other Mexicans that they used for analysis. Right? So this is a good paper. I encourage you to go through it. So in the next lecture video, I'll take you through. Ah, the actual You know, the get up page off Vader on. We'll go to the cord and try to understand. How does the calculation actually occur? Right on. Then we'll also use some sense, sample kindof sentences and try to see how the how waiter performs. Ah, using heightened script. Right. So over to the nest. Next lecture. Thank you. 5. VADER Introduction: no friends. So let's look at the good help page off the Vader sentiment and I'll analyzer inviting. Right. So just type very sentiment Get hub. And the first dessert should be the Get a page that you are looking for, right? So this is the very popular Vegas and demand get a page on installation is fairly straightforward. It's pip. Install later sentiment as the best with gaps. Right? And let me show you what all are important hair. So if we go into the very sentiment folder, you will see Vader underscore lexicon dark txt. Right. So let's open this. So this is the actual Mexican, that Vader user, right? So if you go through the paper that I showed in the previous video, it tells you how it collected this data. So they used a survey on Amazon. I think Amazon has a survey facility as well. So they used the survey, but in the flashed around noon, words in in front off respondents. Right on. I think they took 123456789 10 responses for each word. Right on. Then they whatever waas each respondents response, they kind of saved it right. So this is pretty much how this lexicon was very great. So, for example, if you look at the world like as astonished, right. So 10 individuals gave ah these ratings to this particular world right on da. I would want to ah relate that the rebalance ranges from minus four to plus four. So it then takes all these values and calculates the average and average is the polarity score. You also see a number here. This is the standard deviation. I have not really seen standard deviation being used anywhere. I went through the court so I don't really know what's what purpose itself sale but in years Ah, what you should be interested in in this course that this is the average off all the pen Ah responses that, uh, the survey received for any word, right? So bored, like beloved you would expect it to have ah, high polarity late. So So this is how this Ah ah, Mexican looks right. Very similar to o the image you have in your mind based on the previous selectivity rate. Ah, there's another Mexican and this is emoji lexicon rate. So this is where they have Ah, kind off used all the images and they have kind of named it Onda. Then these. Ah, these strings have been assigned polarity as well. Right? So this is the Mexican off her sentiment analyzer. Okay, let me show you the actual court as well. So later sent him and not be by right. So what again? Ah, go through it. It's not a very difference. A fairly straightforward, very easy to understand core. Let me just show you the important part. Um, welcome to this later. Yeah, So you can see that the compound score is computed by summing the valiant school of each word in the lexicon. So say, for example, you have a sentence and you have victimized the sentence. And you have come up with five or six words in that particular victor. Right? So all it does is that it calculates a compound score on this compound score is the individual some off individual polarity score off every single word right on. Then you take the summer off it right on down the last step. There's something going Normalization. Andi, After normalization, they tried toe Ah, bring the compound score in a range off minus 1 to 1. Okay, so this is the actual function that does that on how it is normalizing is is that it takes the compound score and then it divides by the square root off the square Squire Bless Ah, hyper parameter called al Far right. So, for example, if the score is three rate, if the compound school is three, the normalized score will be three divided by under route nine plus 15. So under 24 under 24 is pretty close to five rate. So with a score of three, you will get ah, you will get a normalized score off around 0.6. So that is what they're doing to normalize the score. And, ah, take it between minus 1 to 1. So this is what they're doing, why they're doing it again. Ah, I mean, I mean, I think it's a fairly good approximation off crying toe. You know her. Detain them in density while bringing it toe to toe the scale of minus 1 to 1. Because most off the sentiment and analyzers used this ban on If you do not know, the actor methodology off how reader is calculated that the horizon is from minus 4 to 4. Ah, you should be able to use that. Should be able to figure out very easily the polarity. Ah, if the score is between minus 1 to 1. Otherwise, we don't really know what is the rain? That minus 7 to 7. It's minus nine tonight, right? So we don't know. So they just tried to make it consistent with other sentiment. Analyze apps. Right? At this point, I'll just take you to a very simple example script, so just install it. Right? Installing is fairly simple. Exclamation Mark Pip, Install builder sentiment. If you're crying to install it from inside your I d. Let me run this. So this is the so this is ah, where we have imported the relevant more you'll. It's called sentiment in density Analyzer. Right on. This is within the module. Vader sentiment in the waiter sentiment libraries a lot off repeat names hair, and it is. So once you have imported this library, you initiate an object right, which is called sentiment in density and laser light on. Then I'll try to see what kind of reserves you get. So, for example, if I have a sentence card. This is a good course, and I put it Do the to the analyzer on within analyzer. I used polarity and the scores underscored scores method. Right? And you can go to the good half page to see this particular function, and you'll see how this is calculated. We already know how it is calculated on. You can see that it gives you four parameters. It gives you a negative scored. It gives you a neutral school. It gives you a positive score, and it gets your compound school. The compound school is what you are interested in. Great compound score takes into account the positive words the negative world and a neutral word. If you go to the court, you will understand what they're doing here. All it's doing is that in a sentence, the identify all the positive words, all the negative words and all the neutral worlds. Andi, they do. Ah, you normalize score for each cluster as well. But that's not really I mean, this is just an overkill. Outs touting. All you need is the compound school. Right, so you get a compound score. So you see, this is this is a good school has a compound score off point for four little four. Which means positive if I change the world from good toe. Awesome. Let's see. How does the sentiment, Jane, You see there is ah, almost a 50% jump in the in the compound scored in the sentiment. Right? So you see that the degree modification is pretty much working for Baylor, right? Let's try to see what it gives if you say that this instructor is so cool, which I am. So you get a score of 0.4572 Great. Ah, if you use exclamation. Margaret. So you know, in English language of people, exclamation mark, that means you're putting more emphasis on a particular, uh, sentence. Right? So if we put exclamation mark after cool, I am really sure that the instructor school, right? So if if you run it, you see that the combined score kind of increases if you use oil gaps. Meaning? You're yelling. Meaning? You're very excited. This will also be translated into were even higher score. Right. So you see that if you're using all caps Ah, this shows excitement or the shows strong in density. The words that particular emotion. So your sentiment analysis gets for that increase. Right now, let's look at some of the immortal wanton sang. So if I'm using happy face emoticon machine learning makes me happy. It gives you a happy compound school, right? If you use slang, sorrow, fl um, this would mean a happy sentiment, and we give you Ah, happy compound score eight if you use the move if you use, you know, like a bad slang sucks. This movie sucks. They should give you a negative compound school. Right? So this is just an example off how easy it is to use radio sentiment analysis. It's not that it is not without its drawbacks. It has a very strong drawbacks. And we'll see that when be when I use it in one web, scraping where? And I actually scrape financial articles, finance articles on I be tried to use raider to kind of get sentiment school. If accuracy is not that great like me kind off tell you at the outset. Right on. Dumb you Just one thing I missed Ah, hair is that, uh if you go to the main page off this Ah Ah, get up. You can see that. Ah, the makers recommend that if the compound school is greater than 0.0 fire, you can consider that the overall sentiment for that particular sentence or for that particular document is positive, right? If it is less than minus 0.5 it is negative on if it's in the rain off minus 0.5 two plus year upon year. If I you can say that the overall sentiment is neutral. So these thresholds I mean, this is the recommended special, but you can change this special based on your requirement. Great. I will use a more complicated, more sophisticated my transcript. But it will use Web scraping to extract finance data on We will subject that finance articles do. Ah! Ah, better sentiment and lies and I will try to see whether the accuracy is good are back. OK, the next video we'll look at another lexicon based ah sentiment analyzer for which we have quite a library and that's called X block. So I'll see you in the next video. Thanks a lot 6. Textblob Introduction: Hello, friends. In this lecture, I'll introduce you to the text blob by the library on DA again. It's a very popular library for sentiment analysis, so let's ah, explore it. So once you search for text blob, get her. The first result should be the get happy is that you're interested in. So let's check this. Get a page. And as you can see, it's a fairly popular page, right? Onda in the summary in the Read me snippet, it provides how to use it. It's again extremely convenient. All you have to do is import the more you'll from the library and just put in the text that you want to analyze. Right? So it is that simple as that, and that's the reason why it has gained so much of popularity installing also, it's pretty straightforward. Just weapon store text blob on nothing fancy here, let me show you the lexicon that this library users So I think it's in English. Send demanded XML. So this is the lexicon that text blob uses on. If you see this XML file, it has all the words that are part off this particular Lexington around 3000 words right on down. You can see that these words has ah, has has the definition. The polarity scored something called subjectivity. Intensity it sector. Right. So this is the Lexington that text blob uses on. You can see that all these words are taken from borden it. Right. So unlike Vader Ah, this particular library of this particular approach doesn't have its own lexicon. It he uses ah, pre built Mexican word. Net bridge is quite popular for the sentiment analysis. So that's a very important distinction between baiter on da ah and explode again. Ah, something interesting here. Is that how how this is buried, How this particle libraries build is that you can see that they have, ah, multiple instances off the same board right on why they have taken multiple instances because the same word, for example clear can be used in different contexts. Rate on. They have defined the context here, So sense to sense, for example, this this particular clear means allowing light to pass through rate and then the second clear main sidedly apparent to the mind. So these are two different usage off player. Right on. They have ah assigned a polarity score for all these different gone context rate, so you can see that the polarity score is 0.1 for clear, which makes sense because it's neither very positive nor very negative rate. There's something called subjectivity, right? So subjectivity, as you may have guessed, means idea off aboard being ah, defined by somebody else's judgement. Right? Like not being factual. So objective means being factual subjective means being defined by someone else's judgment . Right? So you want the subjectivity school Toby Low And I think if I need some 0 to 1 so you want subject Divinity School to be zero because the owned factual statement is not someone's judgment rate. So this is a subjectivity score. Ah, tender something going intensity. So intensity is a perimeter that they have used for modifier words, for example, of all words like very or boards like non right boards that modify Ah, the intensity off a word that is ah that is defined by this particular parameter intensity . So again, ah, like invader, we also have some factoring off intensity in text block as well, which makes it again very popular on then this confidence level, which I don't really know what is the meaning off it. Okay, so this is the lexicon. It's a good idea to understand what this particular ah library is build on. So let me quickly show you how you can use it in Brighton. Okay, so, uh, installing it is fairly straightforward. Just stop Pippen startext. Blob rate on. You can import. The more you'll call Capital T capital, be text blob. Right. This is the module, which is the Mexican based approach. Let me tell you right now, that text blob also has a machine learning based approach on. In that case, I think the name off the name of the more duelist classifier. Ah, you can import classifier and that will use a pre trained machine learning Ah ah classifier to classify the sentiment off a particular statement. I will briefly touch upon this when I am discussing the machine learning approach. Let me tell you right now that if you're trying to use text block for analyzing sentiment off financial a text, I would recommend not to use the classifier because ah, the training. Ah, the training said that text block classifier uses is based on movie reviews. But once you have this library. Once you have this model, all you need to do is put in the world, and then dark sentiment is the function that gives you the sentiment rates of for example. Um I mean, this will have a zero polarity for sure. So again, if you go through the court Oh, are on, get up. It does pretty much everything that we have discussed. It victimizes a particular sentence. It, um kind off. There is also root word that happens. Rude wording stemming that happens. So everything that we discussed in the initial ah ah, lecture, video everything happens in text ball on once it has victimized Ah, the descendants. Then it performs a calculation that I'll show you right now. Right? So what's like his will have obviously will have zero polarity then This should have some high polarity remarkable rate. So you can see that the polarity for remarkable this 0.75 which makes sense. I think they should also have zero polarity take should have zero polarity as well. Impressed should have a good polarity in eight. It should be close to one. Actually, one very good word and then me should also have a zero polarity. So in in in this sentence, his remarkable work it take impressed me there only two words which are off important right on the overall sentiment is nothing but the average off non zero polite polarity. Um, words, right. So in this case Ah, we had one and 10.75 If you do the average between 1.75 you get point. It's seven pipe. So calculating polarities are simple as having individual polarity off the worlds and then taking the average. So, in a nutshell, this is how text blob works again. It's popular because it's extremely convenient. As you can see from this brief courts number in the next lecture, video will try to use Web scraping toe get ah financial news financially text to analyze 7. Building a Sentiment Analyzer using VADER - Part I: Hello, friends. In this lecture video, we will be scraping the website off a crude oil news aggregator Onda. We will use Vader sentiment analyzer to analyze the sentiment off the news articles that we have scrape once again, Badir Sentiment analyzer is a lexicon based sentiment analyzer. So you can already expect that the accuracy is not going to be that good because ah, and when we're trying to analyze finance news for ah, the sentiment Ah, the the polarity off certain boards can be very different from how they're used in normal language, for example, example words like bullish, bearish. They are very finance related tones. Right. So a trader would, ah kind of construe the meaning of bullish and bearish very different from how a person who is not involved in finance market would do great like vice. If I tell you a statement that markets ended in red, right? So a person who is in the finance feel will kind off interpret best sentence Assam as different compared to someone who is not involved But ah, markets rate. So using a lexicon based sentiment on Liza can be a bit tricky on. I want to use this example to show you the pitfalls off using Mexican based sentiment analyzer to ah kind of analyze Ah, financial data. Financial news data on you'll be surprised how many people use lexicon based approach to kind off, Analyze finance, news or finance finance Later text. Right. So we just see how is the accuracy and how this thing performs. Okay, so the website that will be using is called oil price dot com So oil price start gone. It is, ah, oil related news aggregator and also has place and also like people who are into cruel crude oil or are or later creating, they use this website a lot. So let's go toe energy and crude oil. Right. So this part of the website has ah relevant news for crude oil that you can, uh, you know, just go through. So it's very similar to Yahoo finance. Ah, news aggregation. So you're hoping And also if you will search for a particular tickler, for example msft it will give you the summary of the stocks. And then if you scroll down, you can see all the relevant news pertaining to Microsoft. So something similar is happening here. Okay, you may want to go back and visit the Web, scraping section off the scores there. I have ah, kind off explain Web scraping in quite a detailed way, but I'll use this lecture to kind off revise our Web scraping knowledge rate. So I think it's a good opportunity for us toe revisit and revise our Web scraping abilities . Okay, let me first see if we can get multiple pages off news or not, so you can see that I can get page to base three, you know, based do pastry and all pages off multiple pages off news article. Okay, it's really annoying this video that comes up OK, so you can see that the only difference that's happening is that it's a pending the U L with the page number, right? So let's go back to page one. Great. So this is so this is going to be the u. N l. And this you will real change based on the number that you have in the subjects, right? This is annoying. OK, so let's Ah, now, now let's inspect this particular page, right, So as you can see, ah, the u R l is hidden within the class category article bait on the tag is a terrorist, right? So now it should not be that difficult for us. Right? So we first of all, uh, you know, import request on beautiful soup. The two most important Ah, libraries. If you're crying toe use escaping and heightened rate onda. Definitely. We're gonna have better sentiment and a laser. I am starting out with some empty lists on what are these? Empty list. So, in one list, I'm going to store all the You are ill that I've scraped on the 2nd 1 I'm going to store the date and time off the publication off every news article, then the actual text body in every news article. UL on then the headline. So these are the four are different lists that I am creating because I'll be storing all this information separately on later, almost them together to create our data frame. Okay, so the first step is to kind off make a connection with the website, right? So I am using a loop on Derek on the use of this loop is that I mean, I want to go toe multiple pages of example. If you want to scrape 1st 10 pages off that particular website or Alpizar Com. You just change this to 1 to 10 or whatever. Okay, so 1 to 11 in this case, if you want to scrape the 1st 10 pages, So let's do only 1st 2 pages, right? Actually, let's go. Three pages because we want to have some good number off articles to kind off, make sure that our analyzer is working. Okay, so 1st 3 pages, the u R l is fairly straightforward. The same you are l that we see here, right? This, um but the only I agree that we have to do is to kind of make this page dash and our dynamic right. And that's what I have done here. If you see, it's the same UL on for ah ah, this suffix I have just used the for matter function in biting rate. So if I So, for example, if I equal to one, if I the one this will be Ah CDB oil price that Page one artist Human rights. So this is what we want, right? So then next step is to actually let me just ah on these and show you step by step how the court is progressing right then the second step is Ah, request or get. You are all right And you all know what it does. It just establishes a connection with this particular you are All right. So if I run this, you'll see a 200 Ah, response 200. That means the connection has been established. No problem connecting do with that particular website rate. Now that the connection has been established less. Let's use beautiful soup to get all the text from that website and we'll be using the HTML parcels are search so that be get the body off this website in an HTML format. So if I run this, you can see the entire HTML Ah, kind off text off this particular website, right? This particular ul page one, this is the end Irish TML body. Right? So once this is done, all we need to do is ah, use various awesome more news off beautiful soup. We just parts whatever data that we're interested in, right. So in this case, we have already decided we have already discussed that category article is the class that we're interested in right on the tag is Ah, actually physical. Do whatever is you are great on that is hidden within the tag A off this live class rates. So I just stuck. You may want to go back to our web scraping basic lecture, video of the schools. Right? So what I'm doing here is that I'm using a loop rate on the soup, which is this entire body rate, entire website body. We'll be using the find all more you'll right? And what we're what we're doing here is that we're saying that save everything with the tag bill within this particular class rates. So everything inside this particular arrow will be saved with this particular function with this particular the final model. So this will have everything in the drop down for diff dag and this class. Okay, So once we have that, you can simply use the get more you'll to kind off save all the trips that are there in this particular class, right? But the only problem is that some of these ah category article class may have duplicated trip. So if you do not do anything in and you simply use get method to get all the address within this particular Ah, Within this particular tag of this particular class, you you will get multiple duplicate, you know, a trips. So to get around that problem, I just used a simple ah beautiful logic. Very. And I'm saying that if if the if the sheriff that we have found right already existed now you are a list, then don't do anything other. And only if it's not there. Do populate my You are a list with ah, this edge If that you have extracted from this Ah, soup object. Right? So it is a bit Harry hair. But all I'm trying to do is making sure that only unique attracts only unique You else are getting saved. Okay, so that's all that I'm doing here. So I have already done this. So it takes some time. It takes around 30 to 40 seconds to run each page. Right? So I already have run this. So let me show you the you are list. Ah, there is that you are a list. So this is the You are a list, right? So you I have around. I have 20 wells that I was able to scream from these three pages, right? So this is the entire you, Olive. You are. I'll stop the video now and allow you to kind of revisit what we have discussed so far. We'll cover the second half off this. Ah, chord in the next video. Thanks a lot. 8. Building a Sentiment Analyzer using VADER - Part II: Hello, friends. So continuing on the scored we have got the list off you else. So half the battle is won. Now all that we need to do is go into every single of these links off these you ours on extract Ah, the date and time off. The publication of the news The text off the news, actual news body on the headline. So these other things that are still needs to be done find So how do I go about doing that ? I first Ah, looked through every single link in this. You are a list right on. Then I first extract the headline because extracting the headline is fairly straightforward . If you look into any off these links, you see that the U. R L itself gives you the headlines of the last ah, part off. The last forward slash gives you the headline Why Saudi Arabia split up its Energy Ministry fishermen, right? The only problem is that it's separated by dash, so we'll have to manipulate it. Andi, remove these dash. Right, So that's fairly straightforward, right? So when I am looping through Ah, this you want a list that we let me show you eat Step off the score, so I'll populate my variables. Right. So let me past this value, do one of the variables, right? So let me. So what is the blue Blue blue Now it is the first link that we have extracted, right UK offshore oil gas is about to boom again. Great. So now we'll start the loop and every single link will be kind off. We'll have all the steps that I mentioned there, right? The first thing that I'm doing they're doing here is to extract the headline like a said extracting headline is fairly straightforward. All I'm doing is that I am splitting this particular string, right? I'm splitting by forward slash or in the last, split it apart. Off the string will be this section. Okay? And once I get that, all I need to do is place their dashes with blank space, right to let me show you what happens when I just split it. Right? So when I when I only split it, you will get a list with every single string separated by either Where your forward slash right now I want the last one minus one correspond to the last element of the list, which is this part, right? And once I get that, all I have to do is it used a replacement erred on DA replace dashes with blank spaces. Let me run this entire thing on. This is the headline there still dot html, which is a bit off a nuisance, but I mean, I'll I'll have to write, like, one or two lines of court to get it up it as well. But just that I am not using headline anywhere else. I'm not using headline for the sentiment analysis I'm using the next on the news. So I have not really bordered about this darkness team doctor taken care off. So we have extracted the headline so this will be populated when the entire loop isn't great. So next is the date a date and time off the publication off the news. Right. So again, we will have toe establish a connection rate request on Go to that particular you, Earl the page off that particularly wirral rate on. Then create the soup off Jeb, you beautiful soup object which will which will extract the entire page on parse it using HTML. Right? So I'm not running this. That's pretty straightforward. Very similar to what we did here. Okay, now how my getting the date and time. So, for they are letters go back toe the actual page. Right. So let's click on this. If you click here, you will see that you have. Ah, the person wears published this on. Then you have the date and time, right? So where is this hidden in the page? So if I open this, Okay, so if you see this Ah, it's hidden under article underscored by line. Right? So in article on Driscoll byline you see the text by then. Ah, the text or than the u R l then. Ah, the person or the institution who has written this particular article on then followed by the date on the text is separated by a dash. So we want everything after the dash as the date and time off the publications. And that's exactly what I'm doing here. Right? So once you create the soup object, right, you look through every single thing that you were found for the tag span and class article byline. So this so this Ah scored this part of the court we looked through every single everything , every single element under the span tag for the class Article byline, Right. So it will get you the in this entire thing. So now all I'm doing is that I am doing I'm extracting text, right? Eso next, we'll give you every single, every single text portion within this particular DAG and this particular article rate. I won't be sure you'd just for one. What happens? Actually, let me just, uh Let's run it right. So we already have a tear. So, uh, Nets run it. So I have the soup object now, which will have everything for that particular page. Right? So this is the UK oil page, Dr. We had just talked about the links that this is the entire issue moderate. So now it will kind off, you know, look true. So before just I'm just just to show you what happens. Exactly. So all I'm doing is that I am but in Dingle next. Ok, Okay. So if I run this, I should be able to see everything that all the ex portion for this particular span in this particular class. So if I run this, you see that you get by that right? Stand energy. September 3 200 a great. So all you need to do now is split it. Ah, using the dash d limiter and take minus one, which is the last part off the string off the spirited string. OK, so that is what we're doing here. So once we do that, the extract or every single date time for every single you are a rate that already done that. So let me show you how the daytime looks. So this is the date time for every single, uh, every single news in the last one. Something two years happened. So I'll have to investigate what happened. But in some in some off the U. S. The format may be slightly different. They may have ah, like an extra delimited or something. So sometimes you have to be careful. This is how it looks right now. The last part which is getting the actual body rate. How do we get the actual body? Let's goto this again. Andi, click on inspect. Okay, so in, um, if you know html you know that the pitak corresponds to paragraph, right? Um you can see that every time that you see a pitak, that means there's a There's a paragraph text paragraph embedded there, Right? So you have to get all the p tax you'll have to parse, and you'll have to aggregate every single pea. Even these things are embedded as paragraph in the page, right? So anything that's text this will also be embedded, as in paragraph. Right? So I am extracting a lot more information that I that I that I require. But then I'll show how you manipulated and get it get the only the the news body that I'm interested in. Right. So let's see how we go about doing that. Right. So we were on a loop again. And in the soup, which is the beautiful soup object with the entire HTML Ah, with the entire rest email page, we find all paragraphs right. Find all is all paragraphs on then we are a pending that toe. What temporary? Ah, to a temporary list. So this temporary list will have more information than we actually require. Okay, Can I show you how to react? So let me let me just run this. It will let me run this and let me show you how the temp looks. So if I run this do you see you get the entire text from that particular page? Right? So you see, I mean, there's a lot off things that we don't require right on. How do we know what exactly, to be required? Is there anything that tells us where exactly should I need? Toe kind of split this to get the actual body. Right. So if you see Ah, So she They said they are like qualifying. Who is that Public publisher who has given who has contributed this? Notice that except tolerate. So I think after more in four is the actual new starts, right, As tens of thousands oil and gas professional gathered in Aberdeen show more in four is where the news is actually starting on. It is ending red. Is it ending? Ah, I guess it is ending at by whoever has published a great So if we become across a string which starts with by forward by something, right, so that should be a good indicator that the text body has and has ended right on starting is with more info rates. This is so everything that is in between these two strings is my, uh is my actual news, right? So how do I go about doing Are trying to just extract only that portion. I do a bit off a manipulation here. It's not that difficult. So what I've done is that I have Ah, I I kind off looked through the entire temps Lestrade. So if I opened them, it is a list rate, so I can It's It's an incredible bite, an object. So I am looping through it and I'm looping in the reverse fashion, right? I've used the reverse ta ah function. So when I use reversed than a traitor starts from the last one, right? If you look through this particular list and if you encounter a string that starts by by right on, then followed by and it's and it ends with oil price dot com. So I found out all these. Ah, by the time standard, do whatever right? The end with oil price dot com babitch is the, uh which is kind of the of website name, right? So I'm saying that if the sentence starts with by and and sweet oil price dot com, that is my last sentence, right? So that's my last sentence. I'm trying to identify the last sentence, and that's what I'm doing it right. But the problem with this logic was that in some instances, you only had you had the contributor, not from oil price dot com, but from other kind of crude oil news aggregator as well, right? So I was missing a couple off news articles, right? So I'm saying that only if you are unable to find ah, only if you are unable to find both by starting with by and ending with the oil price dot com I'm fine, but only kind off. Uh, I'm fine with taking the last sentence. Any string that starts with by and remember, we're starting from the bottom so that there's a high chance that you will encounter the intended string that you are looking for, which is by the publisher rates. Or, I mean, there's good because you're starting from below, not from top, because ah, align may start with by. It's not that difficult in English language rate. But if you're starting from top, the chances are that you may find an incorrect last sentence. So that's why I'm kind of looping from bottom to top. Okay, So once I do this, I get the last sentence right on a once I get the last sentence. All I all I need to do is ah, it is kind off used. The joint mattered on enjoyment. And I'm saying that start from ah, the string that starts after more info, right? You know what index function doesn't bite in? It gives you the index off. Ah, particulars element in in this list. Right? So what I'm saying, what I'm doing here is that I am joining. Do you see more than for you? So I'm saying that once you see more than for it, skip that start from 16 Right? To start from this particular element off this Ah, less straight and go all the way till last sentence. Which is this? Okay on. Then join all this. So from 16. So in this particular example, it will join from 16 till 26. And that's what Will you have me in my body? So docked joint text will give you the actual bodies? Let me show you how join next. Look you. So this is the actual body that I have extracted from news, right. Looks pretty reasonable. So I can go through this. What I'm doing here, it's not that picky. All I'm doing is I am in identifying the first line of the text body and last line. No text body and extracting only that part from the news body. Okay, on day one. So now I have got everything I have populated the your unless I popularly date time news, text and the headlines. Right. So all I'm doing now is ah, putting this inner data frame. Right? So let me show you how my data frame looks. So this is the data frame. It has a dated has the headline into them use Andi? I've also done the sentiment score, so I'll show that to you later. So So this is Ah, 90% of the work. So calculating the sentiment is that's the last step is just a couple of lines off court if you're using Ah, Mexican based approach or hit. So this was the main effort scraping and getting the data in the required format. So once I have this data frame, all I need to do is ah, initiate on analyzer objects, sentiment, intensity and Liza. We all know that from a previous lecture. Video Right on da. Then what we need to do, eh? So what I've done is that I have defined a function called Composite Score. Remember in Ah, Raider, it gives you all the schools, like positive score, neutral score, negative school and and compound scores. So all I need is the compound score eight. So I've just written a one line function wherein I am only extract getting the compound polarity score for a given text, Right? So once that is there, all I need to do is apply dysfunction to the news. Um, column off the data frame. So this is how it looks. So this is the news column which has the actual body off the news rate on I am applying the function to this column order data for him to get the sentiment scores. Right. So this is how it is done. But again, I went through the sentiment score. And let me tell you, it was not that goods around 65 70% accuracy, not ready. Nothing to brag about, right? So, for example, why were the so many errors. So the same thing Ah, that we discussed Ah in the previous videos. That financial article have certain contexts that define whether it's positive or negative . And most of these Mexican based ah kind off dictionaries are not equipped to handle that. For example, what does this say? So UK offshore oil gas is about to boom again, right? So this is actually a terrible news if you are, ah, crude oil greater rate. Because if offshore oil gas is about to boom, that means their demand. That me, the supply is going to go up right on. Right now. It's September 2019 and we all know that the crude oil market is backward because there is a lot of supply in the market. I mean, we have a lot of surplus, which is driving down the prices. So if you are ah, crude oil, traitor, You don't like this new because this tells you that more supplies going to hit the market, right? So they should have actually bean negative score. But where things are not that some of the news articles that there are fairly straightforward Ah, doesn't that sentiment school is pretty good. Why so dearly be split up its Energy Ministry. Okay, so this is a positive sentiment. I don't know. So they may. It may have found a lot off positive words in the in the text rate. So I mean, that's why it has given ah, high school. OK, so just go through it, you will see that. I mean, the sentiment is around 50 60 not 50 60% but more than 60% for sure on. That's mostly because the context these analyzers are unable toe fathom the context off financial news articles, right? So that is the problem. So in the next lecture videos, we will look into the machine learning approach, which is definitely better than this. It has its own pitfalls and logistical challenges, but we'll talk about them. But you'll see that, uh, you'll see that the accuracy scored significantly improves if we use ah machine learning based approach. So thanks for sticking around for this video. Uh, make sure that you go through the court again. It is supposed to any questions that you may have in the Q and A section. Thanks a lot 9. Machine Learning Based Sentiment Analysis: Hello, Prince. Before we delve into the new integrity's off machine learning based sentiment analyzer, I would like to spend some time talking about machine learning on my basil guard. Um, at this point, I would encourage you to, um, brush up your machine learning knowledge. If you don't know anything about machine learning, you may find it a bit difficult to follow along in the subsequent lecture videos. Off this section, however, getting basics off machine learning is not that difficult. You can get ah, belt off information out there on the Internet. Andi, I think you should use those avenues to build a basic fundamental understanding about machine learning. Okay, so I'm going to only talk about supervised learning because that's what we're going to use . It's an unguarded them, which is based on analyzing label data. And when I say labour later, I mean, you have been given some historical data right in historical data. There are some independent variables on there is a dependent variable. The assumption is that if you know the independent variables, we should be able to predict. Given these independent variables, what is the outcome in the dependent variable on, for example, in this case, I have data on Ivy League. Admit off various applicants. Right? So these other students who applied to various I really universities bees for their profile , uh, features parameters on this is the outcome, but that they were successful in the admission of North. This is completely fictitious. Did I just build it so I don't Ah, read too much into the actual veracity of data. OK, so for example, in this case, the first person had a safety score or 15. 80 GP or four Onda. Her parents were not alumni at an Ivy League institute on. She actually was able to get admission when I really institution. Right? So likewise, you have information for 1234567 individuals. Right? And you have the label data. So you have, ah, label dependent variable for these students. Right? So, based on this data, if we are, if we are asked, what is the probability off? A student who scored more than 1500 has GP off more than 3.2 on whose parents are not another night at one of the Ivy League institution? What is the probability? Will you say that person will get an admission or not get mission. Okay, so this is the problem that we're trying to solve using machine learning algorithm. So to solve this Ah, you can use based rule again. I would want to take you back to your high school. Ah ah, probability classes, math, high school mathematics. So basically say is that the conditional probability can be calculated using this formula. Right? So what is conditional probability? Conditional probabilities. Probability often event given and even already happens. A probability off me winning the marathon given I trained well, ours daily. So, for example, the probability off me winning a marathon is pretty low. I'm pretty lazy guy. So probably of inning Marathon is 0.3 But if I wanted to calculate the probability of me winning the marathon given that I trained 12 as daily, this number will change. Right? So conditional probabilities probably be more important than you know, The basic probability. Because in the real world, you will see that most of the application of probability ah, constitute some farm off some application off, you know, basically ruler conditional probability, right? So what does base will stay if you want to calculate the conditional probability probability off even by given even X. This can be calculated by find by this formula which is probably the off X given y into probability off by divided by poverty off X. Right. So in this case, why is the independent variable in our case on X is they're dependent. Feature Rector X one x two x three Excellent. Don't worry about it. I'll tell you how we are going toe Use a basic rule into the problem that we just discussed in the previous slide. OK, eso in machine learning We used naive basil guarding them and naive base is a modification off base rule. I would not say modification is just an assumption, a naive assumption that that the features are independent. So this X is dependent feature back. Okay, You know, I'm using a lot of technical terms and you may be finding it a bit overwhelming. So let me explain what Capital X is right. So in this case, if I tell you what is the probability off getting I really admission given Ah, person score is getting around 1500. The GPS greater than 3.2 alumni zero like that. There is no element, I parent. Right. So this is Capital X. This is the feature set that we are working with, right? This is capital exist. Remember that. So it's vectors when I say Victor, just think off it as a list. Right? So you will see this vector. Tom used a lot in machine learning. Ah, you know, uh, research, Don't get intimidated. Wrecked this tingle, Victor as a list or features. Right? So it has more than one features. In this case, there are three features, right? So that's the image I want you to keep in your mind when you hear the word, rector. Okay. So if so than naive assumption is that all these features are independent of each other, so the person's assessed s a T score persons gp a whether a person a lemon ir with a Persson's parents are eliminate or not, these are independent event, and you know that. I mean, that's a highly naive assumption because person has a high school in the city. Chances are high that that person's GP is also pretty high. Right? So these are not completely independent event, but we assume that all the features, all the independent variables are independent off each other. Right? So that's the assumption off naive Bayes rule. Right? So? So so And again, if you recall from your probability, high school probability electors, Event A and B are independent. This implies that probability off a and B physical do probability off a into property of be right. So this is an important Terram that we're going to use in the subsequent side. I'll show you so therefore, the probability off a person scoring more than trippy handed in a city on percent hello GP of 3.2. And the pro person's parents have not an 11 A is equal to the product off individual probabilities off these events, right? So with this knowledge, we're going to solve the problem that be hard in our in the biggest late. All right, so here is the problem. What is the probability off getting into? And I really, with these independent variables with these feature sets, right? So we have to calculate that. Great. So it may look a bit overwhelming, but again, remember, that base rule is your best friend, right? So, using Bayes rule, you can convert this problem in, Do the product off probability off all these things given admission and do I really rate Then my times probability off. Why? Which is getting into I really which is this this storm here, divided by probability off these individual features. So probably d off getting scored off more than 300 getting a lot of getting GP of more than 3.2 and parents not being eliminated. So now you have calculate these three probabilities and apply that Andi, just use Ah, you know the basic rule and calculate the answer, right? So then it's fairly simple. So what is the probability off? Why? What is the probability off getting into an Ivy League? So, based on our data, that is 1234 So we have 77 rows of data, and out of that four people got admitted into lively on. We had ah, seven total number of people. Right? So based on this data, the probability of getting into an Ivy League school is four by seven. Again, don't read too much into a data because I've just made it up. In actuality, disparity would be fairly less let make less than four or 5% right? So this is the probability off why, Right now, let's calculate the probably the effects this we will calculate last rates of what this world is probably off. X probability of Capital X is nothing but the probably off scoring more than 1500 city scoring more than 3.2 in GP A on probably off Payton's north being in ah, not being a lover nitrate. So we have we have discussed in the previous light that we're assuming that these three are independent of each other on their four. To calculate this probability, all we need to do is calculate the independent probability Andi can't find the product, right. So all you have to do is calculate a probability off Snd being more than 1500 times probability of gpb more than 3.2 times probability or it billions not being alums, Right? So let let us calculate, What is that probability? So it's again very simple. So all we have to discount. So how many times you see the S A. T score more than 1500 so one two rates or to buy seven, which I have put here. Sorry to buy seven rate. Then what is the probability of GP a boy GP A being more than 3.2? Um 123455 by seven. Right, So we have fibre cement on. Then what is the problem? The off parents. Ah, no opinion being an online. So again after I will read that as stated by seven. So you have You can calculate this value as well. Now this is strict. This is the interesting. But now you have to calculate X given vice a probability off. All these things given admission to an Ivy League was secured. Now let's try to find out how many times you have this combination. Andi, you had admission into and I really right. So you have to find how many times a city score with more than 1500 on GP. It was more than 3.2 on the parents were 11 I and the I really admit was secured, right? So that is only in this case, if you see right, so problem only in this case that is happening. So that is one by seven Passport this. And now that you have these three probabilities, you can very simply applied them. Just the supply of these numbers to this neighbor based here. Um, and you can get the answer, which is 0.93 which is pretty close to one. So you usually if you would have got to score off more than 10.5. You would have predicted that that person will get an admission into, and I really So this is the math. This is the basic math behind a fairly sophisticated machine learning model. Ah, naive Bayes. It's used a lot in the industry. And once again, I am reiterating the fact that don't get over by machine. Learning the mat is nothing but high school level mathematics. So all you need to be good at this probability coordinate geometry a little bit off, you know, calculus on also linear algebra. Linear algebra is very important. So just your basic mathematics is required for machine learning, and you will be able to understand what's under the hold off that machine learning algorithm very easily. Right, So that this knowledge will go into the next lecture video, which will be about how to use naive Bayes algorithm. So, you know, apply this. Ah, well guarded them into building a sentiment analyzer. Right 10. ML Feature Matrix & TF-IDF Introduction: Hello, friends. So let's start building a sentiment analyzer using the naive basil guard, um, which is a super wise learning machine learning mortar. If you recall off for supervised learning models, the most important requirement is labor data. Right on. That's what I have for you here. So, for simplicity, I have I've taken these four. I'm calling them article, so consider this. Consider that these are four news articles Right on then I have labeled them as being having a negative news or positive news based on my liking rate. So in this case, stocks continue to face headbands. Obviously a negative sentiment article. Earnings have been stronger this season. Very positive news rate. Stock valuations are at record high. I mean, this can be positive or negative based on your perspective. For me, it's positive, but because I'd like to be a moment, um, crater. But for people who are more kind off conservative, this would be a very negative news rate. Earnings off Belber. The stocks are weak. Obviously a negative sentiment article rate. So this is my corpus right? And this is my training data. So these are labour's news articles on I want O build a machine learning classifier, which will learn from these Ah, from this corpus. Right on. Then, if I give it a new news article, it should be able to predict whether it has a positive sentiment or a negative sentiment. Okay, so that's what we're going to do. So again, first step in NLP is ah, organization. You all are frozen it by now, right? So I have done basic organization and I've also reduced them to their root form again. Not something that all, if you will do. But that's something I've done and I've taken out stop words, right? So the first article gets reduced to this particular list. I'm calling this list, but you can also call it director. Right. So that has stalked continue face and Hedwin Onda label is minus one because it's negative . Right? So, like what? I have done it for all the four articles. Okay, so this is done. Now, the second step is if you remember the previous slides illustration about Ivy League admission, you need to kind of have ah, a ruin column. Kind off. Ah, representation for machine learning algorithm to work, right? Um we call it fifth feature matrix on. That's what you need to do. So this is pretty simple in this case. So we're assuming that every feature is every single unique token. Right? So in this case, I have 123456789 10 11 So, 12 unique tokens or 12 unique words in this corpus rate on they constitute my feature. So I am considering them as every independent feature off my feature set right on these, you can consider these as independent variable rate on. Then all of them have ah corresponding dependent variable. So, for example, the first roll off my feature set, I would have ah, all these features, right? So it will be columns, right? And then you will have to update the corresponding value for every single row. So for the first rule, we only have values for stock continue face and had been so I will have 1111 for stock. Continue facing headwinds and all of these will be zero or man. Okay, whatever you want to do. So this is my first rule on in the first row. The labelling is minus one like by four second row we have earning strong and seasons. So earning strong and season right, This is 11 and one on The sentiment is one. So we put one here, right? Likewise. You will have to do it for all the every single article, everything, every single document in your corpus. And you will have to populate what is called feature said our featured matrix rate on do you. This is pretty simple. So every single row has given has a laboured dependent variable. So that's something that he will put here anyways. Right? So that's the first thing that you need to do. So if you have this particular ruin column representation, which is called the feature, set it then you are ready to apply your machine learning algorithm and let the algorithm loan from this particular data and then hopefully do a good job prediction, right? But before we start applying, machine learning models are machine learning algorithms on the state asset. We need to perform one more thing on that school. The F idea. Afraid so. P f i d e f stands far. Don't frequency in worse documents frequency on. We use this methodology simply to assign a weighting factor toe every feature, right? So what is waiting factor? So if you look into this particular corpus right, you will see that the word stock is appearing in pretty much every single Ah, you know, every single line, every single article rate stock stock stock on. That's pretty obvious because these articles are about stock market rate. So this word stock does not necessarily have an import on the sentiment rate. Stock is something because we're talking about this particular subject, you are expected to get a lot off. Ah, this word in every single article, right? And that's something that that's gonna happen. So we need to penalize it, right? Because frequently occurring words, words do not. I mean, there's a high likelihood that least frequently occurring words are not actually influencing the sentiment of the article. But they are mostly about the subject. Ah, that we're talking about. So, for example, if you have thousands off articles about more while phones about smartphones, right, so obviously the world like smart phone or battery will occur a lot right in every single article. That does not necessarily mean that that particular feature of that particular word will have any influence on the sentiment. But if you do not be life such awards, if you do not penalize high frequency words, what will happen is that when you will apply that model, and when you apply your naive Bayes or any other machine learning based model, the mathematics the way it is, it will kind off. Ah, it was kind of ignore all the other features, right? And only have one on one mapping with the highly occurring high frequency feature, right? So, for example, if the word stock occurred six Children times in this particular article rate and all the other words were occurring two or three times right, like vice for all other articles. If the word stock is appear, intelligence of names and all the other awards are all other features are appearing like once or twice, right. So once you're play once you will apply your mathematical mortal, you will see that all these will become redundant on bail. Be pretty much one on one mapping with that high frequency feature on the levelling eso again, I'm It may be a bit abstract to understand, but as we'll practice more, you will see that high frequency words you need to penalize them. Otherwise they will completely hijack your entire algorithm rate. And that is where the if idea of comes in. So if I d it is a mathematical construct, we penalize every single value based on its deify DF scores. So the more the more the frequency or forgiven word, the more we penalize it. Right? So what is the if i. D S O. D f is calculated by finding the ratio of the number of times that term he appears in the document divided by total number of words in the document. For example, in the first article, stock appears once rate on then therefore ah, kind off poor words for keywords in this particular article rate. So the TF value for stock will be won by four. Okay, one by four. Now, what is idea if idea is the natural logs or locked of the base e off total number of documents divided by number of documents that turned d in it, right. So in this case, total number of documents is 1234 right under and the number of articles in which stock appear is 123 So ah, ideas will be not a natural log four by three. If stock was appearing in all the four articles right, then the idea would have bean natural log four by four. That means log off one which would have been zero. Right. So again. So this is a very pretty intuitive. So if a word is appearing in every single of the article, then you are actually reducing. Ah, you're actually reducing the value to zero because the idea of that Ryba zero So there's a high likelihood that that word is not actually a sentiment influencing word. But that's like a, you know, subject oriented word. So that's what we have done. So in this case, if we apply it TF idea, if the value of stock gets reduced from 1 to 0.7 do but that is our revised feature set. Okay, so you see, if I d have to improve the future said before feeding into ML model. I've already discussed about that. If you don't, if you don't realize high frequency words, chances are that they will completely hijack your murder rate on DA. That is what we're trying to do. here. So in the next videos will show you a python. Gore. We are using plain data labor later toe build a model on, then predict. Ah, the sentiment off any new articles that we are kind of encountering so over to the next video. Thanks a lot. 11. Building a ML Based Sentiment Analyzer - Part I: Hello friends. So allow build a machine learning based sentiment analyzer using beytin The first step towards building this analyzer Ah, would be to extract data on I'm using the same script that I discussed a few lecture videos back, right? Ah, the same web scraping. Ah ah! Script to get news articles from oil price dot com website. Right. So I'm not going into this again because I spent quite some time in going through the entire court rate on dumb. I am saving all this data in CSE file called Crude Oil News Article starts es feel right. So I do the web scraping, and I dump all this data data in a CSP fire. Okay, so let me show you how to see a C five looks. Okay, So this is how you see us. We file. We look great. I am extracting only the headline Onda body of the news rate. So Ah, this is how the CSP file will look right now, the most important thing in the day data part off the process is to label this data right, and you have this difficult job off going to the news articles and labeling them right. So you will have to create a new column called Sentiment On. Then you will have to go to these articles and label them. Yes, and this is Ah, this seems like a really frustrating dust. But believe me, this is the single most important task in building a machine learning based sentiment analyzer. And you will see why I'm saying this Because the court is quite simple. I'll show you in the subsequent videos that the court is like hard leaders into 7 60 or 70 lines, right? It's this task that is the most important and the most crucial. If you have garbage training data, your model will spread our garbage. Right. So this is garbage in garbage out kind of principle that works here. So you have to make sure that the training data is good, because if the training later is not good, the model is not going to be good. Great. So you'll have to spend some time level in this, right? So I have labelled this. Let me show you the labor later. So this is ah, the labor data. So I went through all the news articles and using my knowledge off crude oil economics. I label it, so any time there is a supply disruption, that was a positive news, because if you are an oil trader, you want the price to go up rate when there is over supply than this negative news. When the buyers, when the economy is picking up, when there is more When there is news about, you know, like ah, developing economies, buying more oil to kind of turbochargers their economies, that's a positive news. So this this'll evening will be based on how good or how familiar you are with that particular sector. Once labeling is done, will now try toe implement the things that we have talked about, which is, ah, victimization, applying may be's mortars and then predicting on that I'll show you in the next videos. 12. Building a ML Based Sentiment Analyzer - Part II: Hello, friends. Now that the data is ready and labeled, we can now use it Toe train, our machine learning model. OK, so this is a script where we are training our model and you can see it's a pretty small script because most of the heavy lifting that we talked about in the previous videos about building the Matrix about, you know, applying on a Navy base base, totem and calculating. You know, the probability that sacra all that is already done like secular libraries, more news, right? So that's I mean, the actual machine learning models. Actual court is not that important. It's the other things. It's the data processing. It's the it's understanding what parameters to you that spectra is more important if you are a practitioner, right? So let's see how to go about doing it. So what other libraries I am importing. So from SK Learn feature extraction text module. I'm importing counteract riser. This count backed riser will do your to organization and, you know, remove stop words that sacra. So this library will do everything that we talked about regarding, you know, and doing all the basic and it'll be stuff right then the second library I am importing is TF idea of transformer. And as you may have figured out, this is the library that does all your TF idea of calculations. So I mean, all these things are already there, so all you have to do is import them and implement them. So it is. So it is that easy. But again, uh, that doesn't mean that machine learning models is a joke and that you don't have do anything As we have discussed, there are a lot of things that you have to be very mindful off. And that's what I am emphasizing in these lecture radios. Right then. Ah, to implement the the navy base Modern will be using Gaussian Navy base. Right. So in s Kaylan Ah ni based ah model There are a lot off, you know, like functions that you can import. Ah, I am important Gaussian Navy base because because my data has more than two classifications . So let me show you. Yeah, so if you look at the labouring that I've done, my labels are boat negative, positive and neutral rate. So there are more than two classifications, right? So if you have more than two classifications. You should go for Gaussian. Naive Bayes, right? Only two classifications. It's recommended that you go with multi, no meal, Navy base. But again, that's something that I picked up from, You know, some discussions and forums where practitioners discuss about these things on. That's why I've used it. You can use multi no meal Navy bases Well, but I found that accuracy was not bad. Good. Okay, So for the purposes of this lecture video, I'm using Gaussian 90 days. Okay, Banda's. That's very obvious way you're using it. And then there's a very interesting library that I'm done importing here. And that's called pickle. Right? And I'll tell you what Pickle does as we go through this core. Great. I'm pretty sure most of you already know what pickle vibrators, but I'll I'll discuss what pickle is doing in this case, right? So the first step is, um, loading the data. So I have loaded the data. Now we extract the column with news articles. Body right. So the second column, if you recall, was the news body in the first art in the First column, one headline. We're interested in getting the news body and that is what we have got here. Right? So we have got everything here on DA You can see that I have used in cording I s 0 85 9 underscore a dash one here. Right, So this encoding lets you process, you know, you see these characterises a Mandarin characters and all. If you don't use this and Gordon, you will get an error. If you are having some tricky correct, there's in the content that you are scraping, right? So that's just be mind mindful of that, you may need a different including based on the kind off articles that you are scraping, right, So that's taken care off. So I have extracted Ah, the news articles Body in a variable called capital X rate. Now the funds starts. I will initiate an object called wrecked riser. Right on. That is an object off the class Count with Preiser, which I we have imported from SK loan feature extraction text module rate on discount Victor riser. Ah, let me show you the documentation of this particular function, right? So if you search for the skill and gone back, try there It should be the first result of that you get on Google right on. This is the documentation off the contract, right? The function and just like other popular bite and libraries. Ah, SK learns documentation is also pretty dark sided. So I would recommend you to go through some other documentations if you are interested. Right? So what does Ah, this count vaporizer does It converts the collection off text documents to a matrix off token gowns. And because you have paid attention in the pastor lecture valuable. You know what this particular function will do? It will take all these news. You know, it'll take the body of these news. Sorry on it will convert it into our matrix into a feature matrix. Right? So that's what it is doing here. So if I initiate this up So I create this object of class conflict right there And it has stopped words as English. So the organization, the NLP steps that it will perform it will organize. Ah, you know, you're your news article on. Then it will eliminate the stop words. It already has a repository off. Stop words. Stop works in English. I think it is built on an antique itself. The stop words that beach that we saw in the previous videos. And that's what it is doing here, Right? Then in the next step, we are using the fit transform 1/3 rate off that riser object on. Then we have passing the news articles that with the body of the news articles that we have extracted in doing this particular variable called extractor right on expected. If I show you what is expected er you will see that it's Ah, it is. What do you think it is? It is So you see what? So it is giving you zero comma to 4560 Komara. Whatever. So everything starts from 000 and then it ends at 6500 and 53. So I think you may have guessed it. So all these numbers so this Ah, first, ah, indigent. The first number off this couple correspond Studer Document number. Right. So there are 66 documents in this particular corpus that we have that we're working with. And that's consistent here, right? So they're 66 new articles that I have in my training data, and this number corresponds to that, right? And then on the right hand side. It's column. So this is a vector. So this is a matrix, right? This is a matrix with 66 rows, Andi. Thousands of columns, which each column corresponding to every single word, which can also be considered as feature. So, for example, in this what What this is telling you is that the feature ah, which is, um which is we can be represented by the first true and the 24 5/7 Column, right. The account of that particular word is stew in the first document. Right? So that's how it is. That's how that's what this particular representation is conveying rate. So this may be a bit difficult to visualize. So let me print out what these numbers correspond with 2456426 So this is all the vocabulary that your count vaporizer has been ableto kind of broken. I rate So it has taken all the her news articles on it has to organized all of them, and it has got so many. Ah, you know, individual unique words. After removing the stop words right on those numbers that you saw. So you see 2654 correspond to liberty rights or Liberty Word correspond to 2654 So if I Then this If you see 2654 somewhere, right, That that that would mean that, um, the word liberty, uh, exist twice. For example, if zero toe whatever number that Liberty correspondent do, if this was that number and the on ah headed there would be two that would have meant that in the first document, the world liberty is occurring twice rate. That's exactly what it means. So we have been able to kind off convert all the news articles into a matrix storm. Actual darts, like 80% of the war, is won now, right? So you have been ableto do such a difficult task using just two lines of court hate. So that's what we have done it. Now, what I'm doing is I am using the pickle library, and I'm using the dump model off the pickle library toe dump this vector riser that we have created. So this object that we have created, I am dumping it. And if you see what I'm doing, I'm dumping it in a file called backed Allies under school crude oil. It's a pickle is a very interesting library. So pickle lets you kindof save any bite an object into your, uh I mean it serialize. Is it? Just think off giggle as it gives you the ability toe take any bite, an object and save it for later use. So if I run this particular command, it will take this victory Isar object that I've created right on it will save it into my drive at a file, right? It will write this particular object at a file that I can later use. So once I kind of save this vector riser as a file in my hard disk, the filing would be victimized. Underscore could underscore oil. I can then import this file back and do my any of the quarter time. Ah, running. And it will be same as the one that is here. Right? So that's why pickle is used a lot because it lets you kind of save any bite an object Ah, into your local right, which you can then reuse. Right? So this is what happenings? I've already done it. So I'll I I already have this saved in my in this particular for that you see here. So I already have the saved. So I just imported when I'm trying to predict. So I'll use this doctor riser when I'm trying to predict Ah ah, the sentiment off a new news article rate. So that's what I'm doing it. And that's another thing that I want to tell. So fit transform is a very important matter. So fit transform is doing most off your heavy lifting. It ist organizing. It is, Ah, removing the stop words on it is creating your ah matrix, right, creating your future metrics. Transform is a very important method rate. So the spars Matt metric that I showed you rate So this ah, sparse matrix that I showed you here, right, that this is not a very intuitive This is not looking very intuitively machine learning kind off in a row column kind off representation that we talked about in the previous lecture videos rate. So you can do that using students. Ah, better. Great. So if you run this, you will actually be able to see the rows and columns that we are used to seeing. Right? So So this is the matrix that we have created and this matrix can now. So this is the three. This these are all independent variables, right? So is the training data Right on. Then there will be a dependent variable that corresponds to each all these rules on Once we have that, we can simply take these, you know, take these two and train them using any mortal that we see fit. Right? But before we do that, if you recall, we have will have to use Ah be if I d afraid. Because they may be a lot off high frequency words which may not be having any import on the on the sentiment rate. So that's also pretty straightforward. All you have to do is ah, initiate. Ah, an object. Deify DF um offorty if idea of transformer class. Right, So you initiate an object on, then you simply again useful transform. And you us This director that we have created here on it will give you transformed the f i D e f transformed. Ah, matrix. Right. So you see that the stew has been changed to 0.3132755 Great. So the same calculation that be discussed. The same calculation is applied here. Just one thing. I You may want to look it up if you are interested. So in SQL on the deify DF transformer also use something called any to normalization. So after calculating the f into I d f, it applies elbow normalization on. That's just one additional step that they're doing if your interest rate again look it up what exactly it is doing. But again, that's for your own interest. The key thing is that it is penalising high frequency words, and it is ah, you know, great. Adjusting all the tokens, all the features that are highly frequent rate. So once this matrix is there, so these are all independent variables now, So this is the training set, right? So this training said would be the entire Tang entire x x DF idea matrix. Right. So this is the training set, right? Except for independent variable, the training set for dependent variables would be the actual leveling that we did. Right? So this is the actual labeling positive negative, it secret. So this particular variable by train would be a CDs off the labels that we have ah given to our data. Right? So now we have the independent Variable said we have the dependent variable cities. All you have to do is ah, initiate object of the Goshen and B class on apply the fit method on then pass the training , the dependent variables and the independent variables and create the classifier. So all you have to do is it on this? It recreate this classifier on this classified is again ready for you to dump it in your, uh in your in your local disk. Great. And then it we use it whenever you want, right? So I am also saving this classifier because it is a train classifier and I may use it anywhere I want. So this is a Clint. This is a crane classifier, and I'm also saving it in my local right. So from this scored, all I'm doing is that I am creating this feature metrics. I am saving this myth tricks into my local rape, and I'm also saving the train classifier into my local rife. And then in the next video, we'll show you when I'm predicting something right When I'm predicting the sentiment off the news article, all I'll do is that I will read in these respect. Riser on. I'll read in this passive restraint classifier. All I'll do is that I'll pass a new news article and just spit out. You know, the sentiment based on the learning based on the training that it got from a You know, my clean data, Right? So that's all that is dead. You want naive Bayes classifier, Right? So in the next video, I'll show you how the prediction happens. 13. Building a ML Based Sentiment Analyzer - Part III: difference. So in the last step be Bill be importing are trained models and back risers on a bill passed new news articles on DSI. How is the prediction performance? Okay, so we will be importing pickle again because we'll need this library toe. Read the pickles lights again. Right. Standard is a no brainer. And then we'll again be using the if idea of transformer to transform the feature set that we'll get from the new news. Okay. From news articles. Right. So in these two lines, I am importing my my classifier on. I'm importing my rep riser, right? Ah, it's as simple as that. You'll have to use pickle Lord. So we be used the pickle dump method. Ah, when you have been tryingto dump on object as, ah pickle object rate on be used, Lord Method to kind off lorded back right into our into the beytin. Right, So that's what we have done here. So we have bought the classifier on the victimizer back into our system right now, I am going to import these new news articles, so I again use the web scraping, um, script and I got some new news articles and I call that file? Underscore Test thes are the 20 on new news articles that I extracted on. I would be applying the models on these new news articles. Issa, let me read this. Right. Ah, so x test variable has all the news articles, All these 21 news article bodies, right? So if I So these are all the news articles that I've just ah kind of extracted, right, and I save them in the X test variable rate. So now all I have to do is past this ex test very a burrito, the vector riser that I just imported. So if you recall the specter riser is already the rule and columns off the features that that this that we were able to pick up from our training data. Right. So this victor eyes will have all the features that it picked up from the training later. Right? So we already have the feature said so we are not using fit transform matter. We're using transform matter. And that is very important, right? Also mentioned that other commenters. So just keep that in mind when you already have a riser that you have trained before That you have got from a training data, right? If you get new data that you're trying to kind off, predict the sentiment for you don't again apply for transform. Because if you apply fit, transform. If there is a new world or a new feature that exists in the new news articles that was not in the training data, it will give you an error late. So all you're trying to do is you're trying to see the features of the words that are already part off your kind off train data rates. So that's a bit offer offer off a trick here that you you will have to understand. So, for example, say you get a word like hypnotize rate on this world did not occur in the training set, right? So my existing Rick Preiser does not have that feature. So if you were trying toe put a fit transform than it will create a new column, it will create a new feature with that new world, right? But if that happens, our prediction will fail because the classifier is not trained on that new feature so that as a result, you will have to use this transforms or transform. If it gets any new features, it simply discards them. Right? So that's why transform is used here on. Then we'll again used to dance to, um, kind of show it in the bay b are familiar. We will have, I think, 21 paroles and then ah, whatever Number off features that, though picked up in using the training data rates. So that's the number of columns that you have. So you have this new matrix rate on. All you need to do now is past this matrix. Obviously we'll have to convert it. It will have a transformer using Deif idea and then actually do this. Right? So we transform it using TF ideas that this is the actual feature metrics that we have. And now we will pass this particular matrix. You are classifier that we have already loaded on. We will predict, right? So I'm saving everything in. Why predict? Right. So this is my final predictor data. OK, so pause, positive, negative, Positive. So, again, I I kind of double checked the performance apartment performance. Pretty good Accuracy is in the ballpark off early, 80% grade, which is pretty strong, right? Actually, I would encourage you to build your own kind of classifier based on any kind off sector that you were interested in on DSI. How the performances and shared it in the feedback section in the Q and a section, if you are. If you're accuracy is not good that I think there's something wrong with the mortal Andi, I can help you improve your accuracy rate, so accuracy is pretty good. So that is Ah, what I observed using there. So if you have to use sentiment analyzer, please. Ah, use machine learning based analyzer Because the accuracy is much better than Mexican based on lies a rate on dumb as you. As I have already emphasized, it's extremely important that we spend a lot of time on labeling the data. Labeling the data is the single most important thing. If you're trying to build a machine learning based classify, right. So that's how ah, the machine learning based Ah um, sentiment analyzer is built. I hope you like the section off the scores and I hope you will be able to buried a sentiment a light and laser off your one and shared your feedback with me right In the next video, we'll wrap up this particle a section. Thanks a lot 14. Sentiment Analysis Application - Opportunities & Challenges: friends. So what is the application off the sentiment analysis that you just bury it, right? So it is a new area new and evolving, 80 off research in the field, off creating. So let me qualify that at the outset, I don't know, a lot off forms a lot off individuals who are using sentiment announces for trading on a consistent basis. Ah, you may build a script and, you know, mean run it in the morning just before the creating starts to get a general feel off how Ah , the news articles over or last night. You know what was a sentiment any, any new important news that have come operate. So that's something that you can do before the creating session starts. But I have not seen a lot off. I don't know off a lot of people who have incorporated sentiment and else's in a kind of for real time trading. Right. So that's one of things I want to try. If I right now, right, there are some commercial products out there. They are some finished products which sell these ah, sentiment analyzers. And, uh, I know a lot of them have bombed already because the accuracy was so poor most of them were based on Mexican based approach and, as we have discussed, Mexican based approach is not recommended at all. Four financial trading rate. So that's the problem. That a lot off early products in this Ah, in the space face. Right. But there is a huge unmet demand for a good and sound ml classifier is you're able to build ah, classifier based on a lot of data on a lot of good data. Then forget about creating. You can actually sell it and make money in in in the in. Ah, in this industry, because there is a massive unmet demand for ah sound Emily classifier for sentiment analysis. So that's something that you can look at if you have time. And if you want to build something, there is a market for that. Great, but with. But But there's some obvious challenges in building something like that, right? The first is leveling large volume off news data and leveling it accurately. So you need to be an expert in the industry to do a good job off labeling, and then you have to be very patient and going through pores off, you know, news articles going back, you know, decades on, be ableto, you know, label them. So that's obvious challenge, right? The second problem is that equity trading, which is the most popular form of trading you cannot have a general classifier, so you cannot have. You cannot plan data are based on Amazon and use that on a company like General Motors. Right? So that's just Ah, that's that's insane, that that's non intuitive because they are very two different company that two different industries or news that is pertinent for Amazon may not be put in. And for, ah, for a company that's in a utility industry or some other industry, right? So you at least need an industry specific classifier, so that means a lot off additional effort grade, then third challenges that streaming new services, which you can use in your in your you know, intraday trading, for example, Thomson Reuters bloomer etcetera, right? They're quite expensive. 40 deal craters. You. I don't no off a lot of people who have bought you know, the streaming new services from Thompson, Reuters or Bloomberg, because if you have that streaming new services, then you can actually keep. I mean, a zoo can see it's not that difficult. Doesn't take a lot of time to kind of spit out the sentiment if you have the model ready. So if you have that kind of that kind of a streaming new service, you can get the sentiment on the fly, right? So that's something that you can do. But again, it's cost prohibitive. It it's a bit expensive, right? So, like I showed you in this Ah, in the section, it's much easier to build. Ah, sentiment analyzer for commodity commodity trading, for example, natural gas or crude oil. Right, Because this is one product, right? Ah, the news. I'm There are a lot of free news websites where you can actually, you know, see the aggregation of these new radical. So before you start creating, you can actually will all that data scraped that data and passed them to an analyzer on get their, you know, sentiment. So commodity trading still much more easier than building something for equities. Right, So this is where I'll stop, right? I hope you like this section. This was quite a lot off material that I have tried to cover over or the past lecture videos. So do leave your feedback share. If you are actually using sentiment analysis and you're creating, that will be an excellent insight for me. And I hope you like the section on dime looking forward to your feedback. Thanks a lot.