Natural Language Processing (NLP) with Python and NLTK | Abhishek Kumar | Skillshare

Playback Speed

  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Natural Language Processing (NLP) with Python and NLTK

teacher avatar Abhishek Kumar, Computer Scientist at Adobe

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

26 Lessons (3h 33m)
    • 1. Introduction to NLP (Natural Language Processing)

    • 2. NLTK: Introduction

    • 3. Structured vs Unstructured Data

    • 4. Reading Text data

    • 5. Exploring the data : Data Exploration

    • 6. NLP Pipeline for Text data

    • 7. Removing Punctuation | Pre-processing | Cleaning

    • 8. Tokenization | Pre-processing | Cleaning

    • 9. Removing Stop Words | Pre-processing | Cleaning

    • 10. Stemming

    • 11. Porter Stemmer in Python

    • 12. Lemmatization in Python

    • 13. WordNet Lemmatizer in NLTK Python

    • 14. Vectorization in Python

    • 15. Count Vectorization

    • 16. N-Grams Vectorization NLP

    • 17. TF-IDF Vectorization

    • 18. Feature Engineering: Introduction

    • 19. Feature Engineering: Feature Creation

    • 20. Feature Engineering: Feature Evaluation

    • 21. Feature Engineering: Transformations

    • 22. Evaluation Metrics: Accuracy, Precision and Recall

    • 23. K-Fold Cross-Validation

    • 24. Random Forest - Introduction

    • 25. Random Forest - Building a basic model

    • 26. Random Forest with holdout test

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

Natural Language Processing or NLP is a very popular field and has lots of applications in our daily life. From typing a message to auto-classification of mails as Spam or not-spam NLP is everywhere.

NLP is a field concerned with the ability of a computer to understand, analyze, manipulate and potentially generate human language. In this course we study about NLP and use the NLP toolkit or NLTK in Python.

The course contains following:

  • Introduction to NLP and NLTK
  • NLP Pipeline
  • Reading raw data
  • Cleaning and Pre-processing
  • Tokenization
  • Vectorization
  • Feature Engineering
  • Training ML Algorithm for Classifying Spam and non-spam messages

This course would be very useful for Applied Machine Learning Scientists and Data Scientists who are working on NLP/NLU.

Meet Your Teacher

Teacher Profile Image

Abhishek Kumar

Computer Scientist at Adobe


Computer Scientist @Adobe

See full profile

Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Introduction to NLP (Natural Language Processing): Hello, everyone. Welcome to this course on an LP with fightin so what it will be and I'll be means natural language processing. But why is it so important? Why do you hear and be terms everywhere these days? So to understand its importance, let's see some examples. So have you wondered when you are typing a message on your WhatsApp? Miss in your How is it able to suggest some meaningful words even before you complete the world? Or have you seen the spam Orjan Fuller off your email? So why you're so many meals getting into those gen folders before even you interrupting and you will see that most of those meals our religion, so will those. All are the applications off NLP or Natural Language Processing. And in this course we'll study award NLP and we will also use the and people kids with bite on. So first, let's understand what is the definition off natural language processing or nlb? So it's a field concerned with the ability off a computer going. This turned in a lace many plate and potentially generate human language, so it should be able to understand human language and human language Can be off. Any language we use for communication can re English. It can re French or any other language, and it shouldn't even be potentially able to generate some language to it can generate some text, and it will look like a seizure returner spoken by human beings. So let's see some real life examples that I explained earlier. We have email spam filters. So based on the subject or the text reading those emails, the email terrible is able to classify the team villages, spam or not, spam and accordingly. If it's a spam, it will move Spam Port of and the rest will be in the in box or whatever. Alex, since you're clear to and then it's also useful in order completely when you do any search on Google. So this is Google search. So when you start typing anything like natural as soon as your tape, it will suggest many different words, like natural language processing, natural language processing in Brighton and so on. You can try this on your own, so its able to suggest many meaningful things, and you can select from that that I want this so you don't have to pay for everything and similarly even in emails. Then you're writing an email. It will try to complete the sentence by its own. And then we are looking creatures who, if you have Mrs Spirits of Ward or, oh, there's some grammatical, it'll roles. It should be able to figure out that very some mirror. So what are the different areas off? NLP So there can be sent human Panelists on topic modeling text, Pacific ism and then we're parts of a speech tagging or sentence segment isn't but within these all aerial, one thing is core NLP and that it extract the information from a block off text. So given a block off fixed, it's extracting the meaningful information on Does Inform. Isn't surgery delivering the computer understanding the language and the really Vince off those information will repair? No, they're different tasks will be soon. Very task specific. So this was just a bit more prison for the starting NLP. In the next video, we will start on people kitchen bite on, so stay tuned for that thing 2. NLTK: Introduction: know that we have some basic understanding off NLP or natural language processing. Let's see, What is any NLP toolkit or in Thought and RTK? So an lt k is a street off open source tools created to make an LP processes in Piketon is your crumbled in the last? Really? We have seen that an apiary revolutionised many areas like it maybe parts of speech taking it, maybe sentence trust, translation it maybe even X generation and many more applications. So there are many inbuilt off functions, and Larry's better included inside this and pick a delivery so you don't need to implement everything from scratch. For example, remember, for example, remember stem function inside this? So what? It does it. If you have many words like quarter, maybe according 3/4 and many more. Then it all comes stone toe. It's these old words air stem to the root word like cold so you can write a complex Larry to do all of these. But it's already presenting an LTTE, and you don't need to do anything similarly for, ah, words prison. So, given a sentence or a text, you want to talk a nice it into a list of words so you don't need to write a custom folks and off yourself. You can just use the balkanized function there. And similarly, there are many more things like a stop words and many more application, which are inbuilt in energetically and many more are being added continuously due to its open source nature. So it's a very useful every. And if you don't know how to use it than anything you want, toe design or develop will be very slowly, so you will need to do everything yourself. So let's start with the set up or at ordinary PK. So first you need to install the analytic. So for more details you can visit, the office said, and they give up site and you can see are the installation instructions for only next Mac or Windows. As for your system, and once you have done that, you can import this using import, NLP K. And if you don't get any error after install ism and you do important decay and then you're fine to go. So you did not get any error once you have installed it and imported in an etiquette in your pipe and Corp you can are known a little bit and explored the package. So once you write an article or download some, you I will come downloaded the way, and it will give you some list off packages. And if you're doing it in or will call em notebook, then it will again be some this kind off list, like before Dawn Lord will for list so you can list the packages using L and whatever package you want to download. You can write D and then packets identify any girl. Don't room that. So if you're using Google Co. Lab, then you can write the all and it will don't know all the packages. But if you're using it from your own and you Oh, come on line or I Payton Norbu Coadjutor notebook. Then it will give a Donald that you were a and from there you can select the package and then click download. I will so that after some time, So first, let's see. And then, once you have done downloading, you want to see what are they valuable functions. An actual is present in this package. You can write b a r N L p K. So you need to import that than Donald. And then it was about a list off the response. Unlike stem as their explained, this is called stemming because it stems these words to the root words and then organized and then us tag. It stands for parts of speech tagging and there will be done sentence off these functions and we cannot. I cannot go through all of these in this really city, but you can explore as much as you want. Many of them will be useful for the application you're trying to build. Been off. Let's see one custom example of this to organize function. So this is just to give you a flavour of ho to use the various components off energy. So first you will important as we have done already and then from animal ticket or to organize be import word token is so this will talk in isis sentence in tow list of words. So, for example, if your input text I'm learning and LP and using an RTK and then you use this word poking eyes on this in court next, then it will return, So this word tokens will become a list which will contain all the words for I, um on 41 still in a picky. So it has to organized awards and return a list of that. So when you print in perplexed Italy descendants when you bring work begins, it will. Realist. So let's see all of this in your journal. So first I will do important. Lt. So I have already installed an article on my system. So if you haven't installed, or then you will get an error. If you do import in every ticket. So first you need to do it in store in a PK, and then you can import this livery. So when I go ahead and run it, I don't get any other. That means it's installed. No way. Like to download the package packages and explore what's inside there. I will run this and this was the you a was talking about. So you will get all the packages list soon. Let's say select all packages and then, uh, I can do I don't know. So it's no loading on the packages, so you can Canton Lord so well for this redoing and come back when it's known order. Now we see that all the practices have been known order so we can quit this window and we get it true here. Now let's exploded and we will do D A, r and RTK. And this will really start all the various functions we have been talking or so you can slain. Dio like us tackle parts of speech tagging and similarly, we had this stem from sin for stemming and we will have recognised another already made from since available year. No, it's a right Oh, are organized example Full it in so dating you organized example. And now let's right over cool, full right from an early peek nor poking lease, import work, balconies and then you love in four txt. And this is are you learning an LP and using energy? K. So this is incorrect fixed. So let's talk a nice it and save it in word tokens. Onda. We will use the word organize function on the in car fixed for her and let's run it. So we get some mirror in court sixties nor defined for there is feeling mystic from again and knowledge printed print in txt on altar Brenda were tokens so visited this is include sentence and the World Cooking Contents list off words. So this just two lines off court using this in a piece in their speeches of ready made function worked organized. You have organized this inter sentence. So that's why you can see how much powerful this energy case and you can see here the list off extensive list off all the functions available. So that was just a brief introduction. Tree and decay. So in further we did we will see more work in a tiki. Important. Thank you. 3. Structured vs Unstructured Data: before moving further in the course. It's very important to understand what's the difference. We construct Children on a structural later because we would be working with lots off on structure. Deter toward the direction of this course. So is structured later, as the name suggests Israeli organized and these are well for mattered and their tree they're easy to search in. There isn't really small db So some examples of names, dates, stack and full was his history, etcetera. So, for example, is you have a greater which come paraded off No porches history on some site. So it may be organized in like, customer name. And then we can have bit off purchase and, um, look, name according T price and so on. And here we can have user one or any other name been did. Maybe you didn't really orto receives. I wanted to Greece 100 loose similarly for usually too. So we can see that there is a clear structuring this data, and it's will form idiots we want to search for, Let's say, wondered which products did usually in 10 produce. Then we can see we can query in the uterus and we can find the Ruger's forming to user ones who all the great associated with fusion. We live in that room so we can easily carry whatever NATO we want. So these are very structured and easy to work with. But most of the time we would be working with UNA Structure detained the schools, and they do not have any redefine form it. So that's where they're difficult to process, collect and analyze, and re processing becomes very brought important here. So you will take some on a structure, deter and then try to find some structure in it. So do some pre processing and try to convert it into stricter ended and then work on this a structured later so some examples could be treats a future so it can be very random. You cannot find any return it who still way the user. Similarly, most of the X date no video data or your editor and other social media activities like Facebook post certain where people posts really use images and any other thing, so these may be very random. They can also oh, seer, someone else's posed. Then the images captured way surveillance, imaginary dear old on a structured it. And we need to find some pattern in them and tried toe Find out some structure in the man than for that we can move and ah, played to over and help your applications. So that was a brief introduction aboard works difference between instructed on a structured Later In the next video, we will try to do some Oh, reading off. X date of it will be in somewhat on the structure form and we will get. I tried to find some structural two of them, so stay tuned for that. Thank you. 4. Reading Text data: or or extra days, mostly unstructured. Later. So, in resolute we will see how to read such text data which is not structured. So for that we will use a popular date. Is it you? Seriously, Mr Spam Collection? Notice it and I have given the link here for their website. And you can even ah google it like you see here using a suspend collection digested and you will find this site in the first regional itself. And here you can go to the data folder and then you can download this SMS collects in Georgia. So I have already don't ordered that on DA. This is the final. You will get inside that Jeff order so you can see that here it's a collection off. Just two columns review is either him or SPM and then this is the SMS message. So this is will mainly be used for classic Isn't task like whether given sms is this Pamela north? So we will use it. Ah, at multiple places in this course. So once you're relieved that it is it, you can go ahead and then no, we will start the actual according so here we will use to apes off methods for reading their detested 11 will be using the open from somewhere. We will just read the raw characters, and then we will manually clean these data so you can see that here it's just two columns. This is separated by your AB, so it's somewhat semi structured. It's not quickly on a structured because we see some strict iria that each message had one spam or him and then tab separated. And then the actual SMS message. So real Maendeleo read the characters and then using their didn't deal emitters, we will add it to our partners, Jake Afrim. And the second would be the easier way where we will directly use Barnabas. You'd CSP function and they said you to see his reforms. And he's very easy on working with ab separated data. So let's begin. So, for Street, we will see the first mentored which were using open So we will see roared later is a core to open. And then there it's it's name. It s a Mr Spam collection. So we will pass in that name here s M iss, um, Gullikson and then not read. And then let's bring the 1st 500 characters that we just read bullets. Run it. So here you can see that after him there is it and then go until journey. And then when this message ends SMS message, then we have a new land character. Then again, I am. And then again Absa Prison and then Eunice Emmis and so on. So it had read all the characters in this Ah, raw data. No, we will do parsing on it because we see that it has ABS and new line characters. So now we will have Boston. Ditto on it will be road return or three place. So first, let's use common limiters. So we will replace all the abs with new land characters so that we're just one d limiter and then split it based on new lane and no nuts. Fost Ditto its first in lanes. Let's so we see that in the first plane via AM because we had replaced the sister also with Nuland characters. So every new land character will be split into it as a new element in this list. So first you sam been the next is this message. Then again him. Then again, message been spam Been message. So alternatively, we have either hammer spend and then one though even party since we hear the SMS message. So no, we can break this complete list in tow to list. One list will contain all the Stammel ham words and the other list will contain just the SMS messages. For we will call the first list of level list which will hold the levels spare more him and we will start from GE Root Index and the second would be MP. Been that that means it will go being learned. And then we will advance May to step. So first it will start from here. It really hurt him then Two reasons so it will escape this and again. I am been Stam and so on he learned. And similarly we will have the actual masses list. It will be the same thing, but here it will start from one one. So now we have separate it into two lists. Let's drink to these two. Let's take the first for five elements of both of these usual proof laid and similarly level list General to sleep and let's run it. So we get some other message here. Full. It says burst list is not defined. Oh, I have written a parson. Ditto. Okay, let's run it again. So here. I should start from one, because the messages are at 135 And these levels are deal before no to Oh, fool. Oh, actually, imprinting the same thing here. So now you can see that the first list has levels 1st 5 levels on the security states. First phase messages? No, we're information separated into coolest so we can combine these two innovator we can use for analysis. So let's no import partners. When does s beauty? And no, we will love go mind both these lists. So combined data frame, it will use partners. They're tough for him. And this Expects index committee were We will specify Buddha the columns here we have just two columns. So we will specify the first column as level and then the list ritual re used for that column for we will use the level list. So the name of the first column will be level, and that column will hold full levels. And then we have the second level, which we love s m iss and for that we will use this message list and then we can print it for you combined deace nor Ted. So it will bring the first slave elements 1st 5 rule. Does this get off him? So again we are similar. Raisman Mass must be of the same land. So let's let's stick beer length Let's sprint no limped off liberalised and then print my specialist and let's come in to talk. So here we see that level has one rule extra. It's going to be one extra so it Ah, this one is extra level. So let's bring ports in the end of this list. Look, Sprint liberalised, Let's print lost three ruse off this literalist. So we see that it Sam Ham and this empty So this is returned in coming from the end of the fight because we have alternated every row. So this is wrongly up under there, so they should not be there. So no regions weird and skip one row from this level list? No, we love equal ins. And let's run it. No, no, it prints. So we have entered the 1st 5 rodeos, this data frame and you can see know that some structure coming or so we have cleaned our data. We have divided into two columns. One column is level when it just simmers. So the liberal or cedar Ham or spam and this Miss column words The Chris warning us in this mess is so Novia cleaner was you know, the initial semi structured later. No, let's see the sort government method to you. See, it's me. So this partners UCSB function is very good in breeding. Extra later, easily Richard abscess Richard So we can rate their desert. Is it cool to be north? Read CSP. We have already important or noses beauty so we can judge it And the name of the file SMS span Calix, Um and we also need to pass the separator. So you're the separatist Deb. And then I heard there is a call to hman. So this is required because, uh, this actual extremely just starts from this level. And SMS message is business or contain or the name of the column. So when we read it in partners data from using reaches reformism, then it will automatically convert this first ham at the name of the first column. And this has the name of the second column, which will not be meaningful. So is reported record movement. Then it will not include this foster or the in the room. No, let's go ahead and bring the first for you. Grows on this do desert. So again we see that we have the same data. But it has named the columns headgear on one full. So this word a really easy way of doing the same thing. So now the cleaned, it is ready for the next steps in the NLP by plane. So we will see more off it in the future. Listens. So thanks for watching. 5. Exploring the data : Data Exploration: so know that we know how to read X Trail text data. It's time to explode it is it. So before diving deep into cleaning the data's it and processing that it's very useful to first explored that you desert and no more reward. For example, how many rules are there in the desert proof? Oh, remember that we're currently working with SMS Spam Collection Desert, which is a direct answer containing some liberals, ham or spend, and some extra message so we can exclude or total side order detested than how many levels are sperm unum in what race you is there detested, skewed more towards having spam lee desert or hem hem levels? Or are they balanced? Then we can also explore how many columns are missing, like some level dispersion. But Xmas is not. There are some Xmas is there, but the level is not present. So does our data will not be useful for us. So we will first get rid of them as those who will not be useful for Let's see all of the one way one. So we will open our notebook so earlier. Weird lizard, The letter uh, we're when did it here the 1st 5 rules. So we're not added our column levels. Salutes first rated. So what's driven? You know, New Desert on day two, where the column names, we would just sit there, Tester, north colons, equal toe and then the column names. It will re levels. And then the second column is yes. Imus, No, let's run it So we did not bring it over. Force trade rules. So now you see that we have the level and essence and we have the NATO. So now it's time to explored in it. So the first thing is shape of knitter. So how many ruler there, or how many rows and columns are there solar? It's so we will bring input data hers. So simply lent off, does it? It will give the number of rules in the notice it. So this manuals and you decide not columns will return this list list off elements. And, uh, it's lentil Brito knowledge printed. So it has 5572 rows and two columns know that we know about the shape off over detested. It's time to explore how many levels are there. So Khomeini Ham was his family. Do so we'll write him worse news. Spam. So we will print to inject us. It deter circles Level is a brutal em. So did assert level is him. It will return one for old lose rules. Aan den. We need to find its length Bullets run it. Fake prince number of him so similarly low for spam. So he's either Most of the rules are ham and very lesser spam who we can also see there. The spammer are north, so common as him. So we get some idea of war over there, Does it? No, it's time very important in full before we re lover classic fires. And, uh so it's very important to understand how What is the Reciever sperm and ham in your detested so Or to build any classifier, we will need to train it on similar a number of levels corresponding to both spam in him. So if we just took where'd and used this detested, we are doing it, it may not be a good classifier. No. Oh, let's look at the last point. Which is are there any missing labels in the data missing levels or even next? So let's word and do that full missing glitter so missing level we will see. So we will use the seasonal function. So when it will be missing, that is, it is no than it'll. Well done. One. And we will need to do this some of note return. So this is just for liberals. We'll do the same thing for message also and for good. The kisses Jiro So our tester does not have any missing message or living's. So that's all you can do for the Rex Flores in on your own. This is just a basic way of exploring or detested. So in the next real you see, what are the various pipelines in the learning? So thanks for watching. 6. NLP Pipeline for Text data: in this. Listen, we will see what is the typical NLP by playing while processing ex dictator. So this is the war all by plane for working with text editor. So let's understand them one way one. So first we have raw data that's presented in the desert. So while working with this SMS, spam collects introducer. This is the road it extra time that we have. So in this a step, the python does not understand any distinction between words. So what? It's easy just a stream of characters and for it all the characters are seen So this strip is not very useful for working. Oh, with beytin. So first we need to convert it to individual words. So this record tokenism. So for example, if I have a sentence in perplexities William teaching and little P in Clayton So all the characters mean the same toe fighting. Even this is space here or there is a full stuff here. And here is your double space. So all these characters are similar to it. So when we talk amazing, we converted to a list between consist off. I am bitching and it'll be in Brighton, So decision list of words. So this record talking medicine. So this very, really held immortal what look at So you should not look at characters, but these characters I have grouped them in towards So these decisions one word decision under word But still some words may be more important than the other words. But I turned is in No, your return word is more important. It just seizure list of words and for all the words are similar. Like in this is tape. All the characters were similar, so we need more processing on it. So this real court cleaning so we will clean their text. So the next step is text cleaning. And what it does is that we will drew remove the stock words. So stop words are the works which are just connecting on the words to form a meaningful sentence. For example, here we can remove. I, um I learned it in so know we will be left to it. It's you and elope earned by item. So we have reduced this to the smallest because these are more important words and convey the text context there dis presenting this sentence. So this I am in these can be presenting or much more frequently in many sentences. And these are just connectors in. These does not air too much to the context that the machine wants to understand. So we get rid of do the stop words our country sins and then before their do cleaning involves. So we have removed stop words and we can also get rid off B Ah, I n g and other things. So this is called stemming. So what it means is that like eating it is teachers. All these will be converted to there rude words which is teach and then NLP and then vital . So this is the result off leaning step. We can do other things also in cleaning. No, next is there. So once we have clean better still, it's not ready to refer to the model because model is not understand. Any machine learning algorithm does not understand extra later, it works on numeric data, so we need to convert this or next to number already. This is the text on our POTUS number. So this is called direct racism. So over trade is and can involve many techniques. So some of the popular techniques for traders and are like or Rick Then be a bag awards. Then we have he If ideas, it means to him for currency in verse Document frequency. So, for example, in order to convert the words two numbers, you can Oh, I drink through your entire or ex data after the text has been cleaned. So obviously it will not contain the stop words or punctuation. And ah, we would have also stem generator. So it will just in a list off words a word to word three. And if this world were according lets the next sentence also, then we will not included. So then we can create a list off unique words irrespective of your order. And then we can assign in next to them like you know, one toe. So if this entire list has three words, then it will have you want to and in order to or represent any sentence like descendants called teach and work three. The clean version of not sixties this beach and word three than its representation will be wouldn't do one because one means teaches present. Jiro means work. Who is not present here on one means were threes present So this is just one divergent, isn't there can be other representation like we can pick a matrix and each group corresponds to were given sentence and each column corresponds to a word. Similarly, there can be other representation. We will see more effect in the folder Lessons on outrageous. And so once we have converted our ex dater toe numeric representation, we're ready to feed it to our ml algorithms. So in in this a step, we will field in the extra Nuta and also the cursed warning level, like spam or ham. So we will feed in that This number represents spam. This number visiting another sentence misrepresents ham and so on. We will feed in multiple and this spammer ham will also be level. So these can be 012 depending on how many levels were working with. So it can mean zero. So this set off numbers means Jiro The set of numbers means one. So we're ready to feeding lots off such clean Judita converted clean and with treasure later toe over algorithm to the mortal. And when we feed insufficient number of such nature, our mortal will be trained and then we can based on on new editor to see whether the model works correctly or not, Or does it need more training? And once we're done with the training and our mortal is really we are ready for being for the or distant that using spent filters on morality and we can deployed for protection and it can be used for real life prediction off whether the email is spam or not. So this was the entire or what? By playing for working with ex data for other type off. Ah, data it may very slightly, but the concept will remain the same. So I hope you injured on the listen. See you in the next. Listen, think. 7. Removing Punctuation | Pre-processing | Cleaning: in the last video, we saw a typical biplane for an LP while working with text later. So this was the pipeline that explained in the last, Will you. So you see that R and D very technologies and an ex cleaning into two steps before it can be victories and inveterate is and we can work in the daytime to some numeric form. Because the computer didn't understand characters or words, it will understand numbers, so it's important to converted to numbers before feeding it to him a little rhythm. So let's combine these two into one step and this real called re processing because we're taking the road text later on organizing it and cleaning it by removing stop worse punctuation and other things stemming and then we're feeling toe for a trace. And so this legal court re processing. So this re processing pipeline know will consist off various things like removal on tourism's punctuation can be no special symbols like this, uh, full stop single course novel courts dollar sign percent is and so on. So we need to get treated those and then we can do our organizes and that Israel convert their or text into words and then remove their stock words like, um, is the and these words which does not add any significant meaning to the context off those texts. So we get rid of the them so that Oh, our Emily algorithm has less words to work on. So these stop words add on to the existing text. And so if, ah, we have extra flint, let's say 10,000. Then, after removing the stop words, it make it below 5000 shoulder no algorithm lab less worse to work with which are important and then reduced stemming Laken working ah, words with different suffixes like cored according quarters and etcetera. So this I have already explain So the or map to a single root word which is called So we do not in order to further reduce the number of words there del rhythm needs to work with. So these are the ministers in text pre processing pipeline. So in this video, we will be concerned with removing puncture is, um and in the further video, you see other Mr Floater. So let's begin may rating according over notebook. So first we will import the warned us and we will read their data. So we're still working with this SMS Spam collects hundreds of us and then we look Duta you lose the reach, sees you function separate Terry stub and there are mornings. Then we specified the column names liberal on message no looks printed. So it prints the first for you Rule of that. And, uh, no, it's very small message. We cannot see what's going on. So if we remove the starboard to remove the countries and it may not be visible to let's increase this length off the masses that is displayed in the data frame. So by before, when we print the dude, Afrim or no data from it will display maximum Mississippi characters who we can change that often using you do know what certain option and then the option name isn't mix cauldron and research 200 so by default did with obesity. So let's run it again. And now again, see that we get to see a longer message so we can see the difference when we work for the Now let's begin by removing the punctuation so we will need toe l a fight on what it sprung Treason. What punctures and looks like because Ah, we're getting rid of punctures in here because contrary since door don't have much meaning . So the sentence So for us from choosing does not add much meaning. What if we don't help fight and explicitly there? Dear Peng Treasons. And these are insane. If the better get rid of them. So on Henry, Builder to bite and vitamin Not know for example. Oh, William preaching and it'll be. And are you teaching and it'll be not. So if we ask Clayton, it seemed, and it will say no. But we know that DDR same. Or maybe almost saying, this is just slightly different. So So this punches and may be very important Fight admitting cuter as important as important as this North word like this will completely reverse the meaning. So we know that this so right then may even think that this Dort maybe was the meaning. But we know that that is not the case. So we need toe will buy time to filter or on traditions. So we'll be string level. He has list upon choosen Louise. We import the string levity, Toby Important to string and then string door on tourism. So it will, during the list off onto reasons for visual pun truisms. And no, we can use this toe filter or our existing fixed. So we will be lover one suction to remove the punctuation. So what this function will do is that I trade over or each character off the text and efforts. If one choosen then are discarded in French north upon tourism, then keep it in the next alerts defined a function remove from tourism and we will pause our extra and what dysfunction will do. Txt no punked London. We can use the list comp, Reince and future of Brighton toe Oh, I treat through all this X characters and insert it in this new list so we can do it in one lane using list comprehensive so c for C in txt if see north in the string north on tourism full for C in txt This will I trade through all the See all the characters in mystic text and then we satisfy condition. If C is north in the spunk tourism, then return this 1st 1 so it will return. See it will noting anything It was just sick if it does not belong to a punches and then return if it doesn't know if it belongs to punches and it will not return. So this isn OpenTable released off characters so we can return. No txt, no. We will use this removed on tours and function on a plate on the second column off already it a frame in musical and store it in a new column. So let's name it M. Is he clean? Do you know Mrs It clean? So, no, let's sprint. The first slide rules off this. So we see that, uh, it has removed the puncture since, like here it was, nor Dart, which has been removed. And in here don't do you in and then this this country's and he's moved here. But it's a list off characters and it does not make too much of sense. So we will join these characters. So here we see that if we just go ahead and join all the characters here, there is a space so very to come on. So then there was a space. It reserved that so we can just go ahead and join, and this will be go a space or in space. And so on Pulitzer, do it. It's a blade this constant. So here we can use join on it. And here this means that between characters don't put anything free flying through space. It will join between each directorate or PSP's full. It's a turn it on granite here, so you see that it has joined, but just space there. So let's not put anything there. Mode looks nice. So it turned. Got rid off all the punches ins, and it's again returning the complete sentence with individual words. So that's that was all for removing confusing in the next video will start from here and work on organizes and to see you in the next video. Thank you. 8. Tokenization | Pre-processing | Cleaning: In the last video, we started cleaning off others in accelerator on. We removed the puncture essence from those texts. Later, you know we're ready for organizing our text. So by to organizing remains splitting the extent to a list of words or tokens. And we we will use the Reggae X or regular expressions for that. And we have an in burglary Ari for that, and we will use our married or to split function. So, you know, I read orders, but we need to specify here the pattern on with just like the next. So let's begin their token. It isn't, you know, notebook. So first we need to import our delivery and then we really find our own custom to organize is similar to what we have defined our custom function for the moving punches, um, salutes defined it, name it token eyes and we need to pass the extra needs to return localized. And now we will use the split functional body. And here we will split on all known word characters. So small the blue means word characters, capital, non word characters. And this means one normal. And then we will pass this text and we will store it in tokens and finally we will return there tokens the list of words or the list of tokens. No, we will play this or to organize function on all the elements of this embassy. Clean column for this is the column. There is free off the punches. Um, this was containing the punctuation. So, uh, let's create a new column and we will call it Oh, in music clean, too organized. And then they will apply this to organize function on all elements off musically using the Lambda function. And let's explicitly help Iten that oh upper case and lower case does not make too much of a difference in this case. So, for example, this Capital Letter f R E e se Mary small F R E. Because three of the meaning off the intended nor thine much so point and will ultimately figure out if it's given too much of data. But let's not restricts processing in figuring or this thing that we can explicitly tell it . So let's explicitly tell Peyton Toe to organize on the lower case letters, and we're saving it to this on bullets during the first Fleet elements off this. So no, it returns the list off tokens so you can see there too. Everything is a small case, and these are the list off words or tokens. So that's all for tonight is and we will continue from here and for the river. Or remove the stop words, which during hard, too much of meaning to it. So thanks for watching, see in the next radio. 9. Removing Stop Words | Pre-processing | Cleaning: No, we haven't got rid off on CHOOSEN and also we have organized our later It's timeto get rid off some redundant words which don't have too much of meaning to our words. Boot words are called stop words, and in this review we'll see how we can get rid of those words. So, for example, there may be Lord source words like, um is boo and many other such words, which is very move also, but the meaning off the sentences. Same. So by removing those l extra stop words, we are giving very less words to our fight an algorithm to work with that very much faster . So let's begin by writing called in the Notebook. So this was the state of the nobuko when we talk a nice to our day tester. So the second column represented called Next Free off punctuation and in the third column, the or to organized in them into list off tokens for words. So here you can see that there are many are stop words like so you in a role. So I, he and these worst Dorner are too much meaning to it. So let's get rid of them so first, we need to important analytical Avery. And this energy very has in been list off starboard's for release languages. Flu. We need to specify the language for which we want the stop words. So no, we have saved it to stop words. No, let's just print the 1st 10 starboard's to see what these look like. Looks from it so you can see that it has printed the words. So these are the Star Wars, and I have just printed the first inwards on. Now we really find our own custom function to remove bus stop words, and we will pass their to organized text. And again we will use the list competence in off Payton, who just, uh, read the newly start off and or list based on some condition in just one lane for real quality, extinct, clean, and then ward four word in txt to organized. And then the current isn't if word not in stock awards. If the word is Norton, stop words been added to this list. New list. If it's in the stop words, then don't and then return extinct. Me? No, we will create a new column. Let's call it MSG. No starboard. And then we will play this Remove stop words on all elements off clean organized column. And then we will print the 1st 5 rules of this alerts. Run it. So here you can see that this is always removed. There aren't religion move. And then in Israel. So this is all be. Get rid off the stop words and no, our algorithm will have much less words to work with. So that's also stop worse. In the future lessons, we will see your stemming which is another approach for cleaning that over there does it. So thanks for watching. See you in the next room. 10. Stemming: similar to removing stop words that we saw in the previous video streaming is a very powerful Boone for reducing the side of corpus. There, the mortal needs to work home so stemming is the process off reducing inflicted or derived words? So the root word or words? Tim. So what this means is that these if you look at this example, all these words belonged to the same root word cord scored. Quarter quarters, according on, are derived from this ward, so they don't before much in your meaning in the context off the words. So it's a good idea toe. Just give the stem word words Tim to our model to learn, so it will have much less off. I would say that it needs to focus on, so that's why he's standing in a very powerful technique. But there so many Pietersen stemming and it's not a little room because these are based on heuristics and there is no perfect rule that gun words every word correctly to a dentist m word. So there are many to pay sufferers stemming one needs over stemming and the second is understanding. So we're standing as the name, say, or that means too much. Too much of a word is cut off. In that case, if too much off word is removed, then the meaning of the word may be lost. And also in orderto map murder Ah, words to the same stem. Too much word may be cut off, reducing in meaning lost. And that's the, uh you do this words having different esteems our map to the seems Tim, for example, OBE can pick one example of the university. Then we have universities and then universal earned universe. So these words injured actually should be grouped together and these should be group together these but the stem Mirmiran leak and work it or to the same stem. And let's sit converse too Universe so one here the meaning is lost. Second, these old force will not be converted to one stim word, but rather well, you could have converted these two two universe would have been better. And these to maybe do something or other system, like maybe university or something like that. But it's so the steamer converse all four to the same stem. Ben, this will be an example off over is teaming and the understanding is similarly approach it on their so two words off the same stem our map. Two different stints, for example, the meme and do it, um so there's timorous led to maps it or reduce it Toby Iti and introduces datum Toe did so it are broken down into two different distance, although boat should have been memory communist him. So this is the case off understanding. No way just aiming so useful. So when we can understand that introduces the corpus of wars that the modern needs to work it on the second, uh, and one pieces that we're explicitly telling our model there these words are correlated. So when we're reducing word one were two or three using a steamer to a common word W then you're explicitly kind off killing them or let d are correlated. And we were just providing one word instead of multiple words. So is to be a provided enough number of examples are very large numbers of examples. Our model maybe ultimately able to figure it out that digital words are similar from the understanding the context off those words. Oh, there they were used. But the model may also may not be able to figure it. Or so it's better if we can explicitly tell it. No. What are the different stemming algorithms? So some off diesel rhythms, our fortress timbers, noblest Timur Lancaster steamer and big business timber. There are more steamer algorithms, but days are included in Antarctica packet and the most popular among digital this or produce timber. And we will see more off it in the next video. So thanks for watching. See in the next to do. 11. Porter Stemmer in Python: In the last video, we saw what is stemming today We will see how to use your stemming algorithms presenting an LTTE package in particular, we will look at or customer, so let's begin in over notebook. So the first thing that we need to do is import energetic e and then from an early peek Nordstrom, we will import or purse timber and then we can here it an object off hurt. Let's quickly see what are the functions of a level in this order. Stomach food is the list of functions, and we will mainly be interested in this stone function. So let's use this. Let's retry on order according and cold and what difference do the same for recording. And no, we just keep called alerts running. So we see that for quarter keeps quarter and for other it just tends to accord. So it's intelligent enough to figure that court and courting these are accents and this quarter is unknown personal cords. But still he is not intelligent all the time, so we will have some other technical limitations and also littering this cities will focus on stemming no, let zone it on. Some more examples distended senator and datum, so we know that these are same when it's plural. One is singular, but the steaming is not in jail intelligent enough, and it does not reduced them to a common root instilled there it has kept it separately. Knowledge stray A few more examples. It's a similar to this court quarter, and according we can use other accents also like Ball Bowling's on Wallace. And they're also the same thing. So it somehow distinguishes between accents from the gnomes, the person who performs Texan. But here we saw that it was not able to reduce it. No, we will use stemming on our own. Or is it? It's a missus Spam collection detested. We were doing clean process, so we will continue in. So first we need to import the partners. Then we will import the Ari backers Andan important string. So this is just though or thing. We're not doing anything new here. And let's in the 1st 4 euros off this dinner for him. So no, we will or do the cleaning process so we will define our clean text function. - Then Livilla split it, so every got rid off the punctuation and then we will split it into tokens and we will use the artillery for that and split on. All the non word characters were normal and then pass in the text. When Lee text equal, you look in your list comprehensive, so we will filter or the stop words here if ward north in stop words. So we had saved the stop worst list of Star Wars cuts funding to Inglis in the analytical corpus in the Star Wars. Very well. So the words are didn't stop worse than with retort. And finally we will return the text so deserve clean text. And then we will aired a column for stop words. So this is this will contain the clean text which will you three off punctual ism and stop words and it will be saved in Emma's Gino Stroke Onda. We will play alarmed or function on M u Z column clean text on X nor law and then go ahead and print the 1st 4 euros off this NATO frame. So clean text is not defined clean text. So we're not run this office. We need to run there. And now we see that we have organized list off words and these are free from punctuation. And I spoke words no. We will introduce the New World thing, which is stim the text. So we will define our own stemming function and dispirited because in organized text and for every word in this list, we will pass it to our steaming function. Very, really using the orders. Timur, as we saw in the earlier example, Flair X two equals. Forget it really Was the list comprehensive, So word forward in organized, fixed. So this is the list note we're passing and it will look for all the world in this list. And instead of returning word given a play, he snore stim or just image on each of these world. And then So this list compliance and will return a new list based on this Grandison and this excellent. And we will be returning this new list from this stemming folks in Let's run it unlovable had a new column to over data from on we will call it a Mazie stemmed and then we will play land off season in a similar way. On this most top Ghanam an alert spring. The first wave rows of digital frame. So again, we have some area here. Emma. Xeno, stop! So there is spelling mistakes on there. Okay, you're talking. So know we have this new column there be of the stemmed words. So we see their disserve available has become a veil. And this beauteous has become beauty. Andre will be many more examples. So these messages are not in procuring this bigger just to common slang. So it's not functioning. That will. But you can see the effect somewhere. So that was all four are huge fortress temer in another ticket package. So in the next video, we will see about another technique off producing these words studio words and that will recall limited ageism. So we're seeing the next rodeo. Thank you. 12. Lemmatization in Python: Today you'll see another popular or normally is an algorithm forwards, and that is called limited Gisin sort so somewhere different from what we have seen earlier , which was stemming, so limitation is the process off grouping. Together, the inflected forms off a word to be analysed in a single root word or limo. So, unlike Stemming introduces infected words properly ensuring that the group world belongs to the language. And here it limits the canonical form Nixon reform or scientist and former officers of words, celebrities and always kind words. The words inflicted words to a form and that should represent in the dictionary for that language. For example, if we have words like ball, more UN bowling, so there's already recon worked it to the root word or canonical word or lemma boat. So it looks very similar to our stemming, but it's slightly more powerful, since it's also looks through the vocabulary. So it does. We'll give Hillary analysis of words, so that's where you somewhat slower. Generally, it's lower than our stemming, but it's more acquitted. So what's their difference between 11 pages in Stillman? So as we re sort speed versus accuracy, trade off so stemming would be flustered and because it's simply tops off the end of the world and it does not understand the proper context. In visit, the word was used, so it uses is simply you stick them or does not truly understands the context and you just talked. So that's where it's faster. But it has. It is prone to errors. As we have seen in the last video, there may be over stemming understanding and on the words belonging to different routes. Maybe group reduced to the same world, which would not have been the case on by server, set up with your case. Also, what limitations in is more record because you two uses more informed analysis and it always introduces to Nixon every word ana. But it's competition really expensive because you, as I said earlier, it uses. We'll give a little analysis. So that's the main difference between limitations and understanding. In the next year, we will see how to use limited years, 11 years in a ticket package, and specifically we were really looking into ward net limitation, which is the most popular off those. So seeing the next radio, thank you 13. WordNet Lemmatizer in NLTK Python: know that we know about Limoges and it's time to use limitations present in in a ticket package. In particular, we will be using Warden Liberty. So word in it is a collection off world, Sir Jack teeth knowns at Worth. And these are grouped together or on see no names off these words. So in order to use word net, limit Tizer in an LTTE for students to import the analytic it package and then in a ticket or were limitation and we will create an object of this and then we will use the limitations function that is present inside. Oh, this And then we can call it on some words like the blue endorsed limitations. And let's sick quarter according. And these things, foot syntax is very similar to ah Porter stammer that we have seen in the last video. So there'll be used in a geeky nor or persistent so in this really also, we will import both of those and try to compare what the difference in results when we use limitations and was stemming. So let's begin in our notebook. So first, let's import the energy ticket package and then the warden, it limitations from a little ticket? No, it's all too important that fortress timber in order to compare the regionals between stemming Ella or limitations. And and let's quickly have a look at water. The valuable functions inside Mr Limitation full. It's run it so these are the main functions, and we will re mainly concerned with this limitations function. So let's compare the region's off cemetery here in this temple on some of the words Solarte stray on whose Ng's so first, Let's try on ist immer golf and then Gee's for receivers. It's notable. Do identify that these two belong toe same world, so it drops them toe Dedo Lawyers and GW. Yes, so they're not even words in the dictionary, so they don't make too much of sense. Now let's play the same words and give them toe or warden it limitations and let's run it. So here we see that it's correctly Oberto difference here between these two and ah Teoh reduces them to a common word goals and similarly looks limit days cactus and also jack time and run. So it's ableto understand that these two belong to us. Him Lima, and it's able to convert both of them toe cactus but would be use, Understand me, gamble on or just Imo and run it. It's not able to identify that and in just blankly jobs. This is so that's where we see that this limit Tizer is much more powerful than oh, stemming so Stemming uses just heuristics and is only concerned with the string it is given . And it is sincerely jobs off some suffix from that word. Whereas limitation is, um, searches, the court must to find related words and reduced in down to the root word or dilemma. No alerts run this limitation on our Hispanic Alexander dresser. Spend it. Extended it. So we will first read the text around next. So really, between four corners as PD. So this is just door stuff and then our e and we would also import string for on Jason rebelling punctuation. And then we will see ups um displayed or Mexico 200. Andre will save the stop words for English language. - And let's bring the 1st 5 euros to see everything is working fine, for there is some mirror messes. Name Emma's is not defined. Okay, so this should be string. Yes. So till knowledge. Fine. And this is just the old stuff. You're reading our data from SMS spam collection in separating them. Bears don't have and we have named the to or columns in the digital frame as level and message. This level campaigns hammer spam and this image, the contents. No actual message, No women. Oh, clean the fixed. No, we will define our own custom function to clean the text. So first we will get rid of the on treason. Then we will split it into tokens, and we will use our resort to split and split it on non word characters. And then finally, we will get truth of the stop words. I'm gonna We will return the text, and then we will use this lean function and creating new column So stool or the list of words without the starboard's and punched reasons for alerts Name. I am a zine or fixed known North drop, and then we live in off play land of Funk salon in Music column and let's bring the 1st 4 grows to see if it's working correctly. So we see that it's working correctly in, or you could have removed some off the stop words like here on religion moved who snore did similarly in is removed so only removed in is removed so we can see that many of the stop words are removed. So now we're ready to play. The limitations on the synthesis ma'am collection deter full. It's defined limitations and function, and it will take the list off organized words. So again we will use the same thing. But instead off returning word. We will play that, Ah, ordinate limitations on that. So we will write the Blue and Lord Limit ties and here organized text. And then we can return this text. No, let's creating new column and we will call it in messy limit taste. And we will a play Lembo funks alone or this last column for each word. We will imitate them and let's bring the 1st 4 year old and see the rewards. Oh, so we're not seeing much off Season two. So here we see that Guzzi is converted to go and leaves his converted to life. So these are not sobering this word, so we're not seeing too much affect, but we see that it's on some words we can see. They put and there were nice Converted its when I only some So you're seeing the real s effect regarded. They're not proper English words. So that's how we limited our next See in the next review where we will move to the next stage off NLP Pie Plain, where we will wreck tries over texts into numbers which can be consumed by our machine learning algorithms for seeing the next we do. Thank you. 14. Vectorization in Python: ridiculous, sturdy ward victory Gisin. So in case you don't remember not see then other people I plan diagram once again. So these are the main stages often lp by plane while processing a text data. So we're done reading with the text data were really heard. The Aurore text, using ponders reaches his re function into a data from aan den REIT organized it. And we also did a data cleaning where we removed the poncho ISMs and after to organize. And we removed the stop words from them and then used them limited ages in and stemming on those organized words. So now, after we have a clean list off to conserve words were ready for this estate with his victories ism. So here we have the list of words word, one word and so on. But any algorithm Payton, it will not understand these characters, but it needs some numbers in orderto work on on train on those. So the traces any just the process of taking this list of tokens, clean words and then converting it into some numerical representation or feature Richters. So we see that we will see that while we can represent each of these words at some feature of it. So once again over trade is and is the process off according textures in teachers to create few directors. And if you don't know what our feature vectors, then you're just victor of numerical features. Their trip isn't an object for Young will. Remember a word. Look suitably one or let's say it scared. And we know that there are only three letters or three words in a dictionary or five words . Let's see and cared comfort. Fourth reason, Then one of the possible representation would be there forcefully. Send anyone. Others heard you. So this is just 11 of the ways of representing this word cared. If Third X men, he has five words on the producing off Kertesz food. Then we can deport at one dressed as zero. So this is a feature vector? No, let's see some more example. Let's say over or Dicks Maria's 100 words and we have ah, list off documents by documents. We mean messages for another. Estimates spank Alexander to be up so this every row correspond to one document. So each message we left some words. So let's say we heard all those the the completely tacit and refigured Or there there are 100 unique words. Then, to represent the first message, we will see how many times the blue one has occurred in this message. Let City Room how many times the blue crew has occurred in this first message. Let's sit home in it. And the Blue Trias occurred, Alert, Seijiro and so on. And this one has occurred. One paints and its level is him. What a spin. Here we have just two levels so we can ignore them. My dear old one, let's say gee, really am and when you just them. So let's its level you. Similarly for second message we are geo. Three wouldn't on one, and it's literally altered you on for third. It's full geo three geo, and it's just them. So this is just one way of representing it And this technique recall as Contract Regis and because we're here, we're keeping bacile values are actually representing the current of that word in a particular message. So Bill is just one of the major victories, and there are other matters. Also, we will see them in little you and this metrics we call Documenta metrics or document? Oh, metrics. So no, you understand. Ah, the concept of outrageous. And so we can represent this first message by Dis Victor Off 100 Dimension so busy under DeMint's and Victor's. And this is George to zero in order one similarly for second oh, sentence. And similarly for third and so on. So the other types of a trade is in our contract reason we already sort. Then we have Ingram's. And then we have defied the official stance for from pregnancy inverse document frequency. So we will see all of these in further reduce. And in the next video, we will see hold to use convert traces in that is included in an article packet. So see you in the next room. You think 15. Count Vectorization: no, there be no world with outrageous. And let's start with the simplest off actresses and techniques, which is contra traces. And so contract rages in Iraq. Traces and technique, which creates a document metrics. And in the last, really, we had seen in metrics leg If ah Rod next had some or documents and you re clean after cleaning dudes or documents that is removing punches and starboard's After talking, ladies and we will be getting a list of Ford's and uh so those if the extract out all the unique words and listed them and also count how many times they have occurred in the entire text corpus and then right. The crisp morning frequencies in each document, like for Doctor on Doctor in da Quan doubloon, appear to pains the 12th year deal pains and to warn. So this metrics is called documentum metrics, so confident trays and win sit learned this word workability and try to create a document of metrics which been very individual cells, denied the frequency. Often that word in a particular document and in order to use contract traders, we will, uh, imported from s killer Peter six trucks in north next we will import contract treasure, and then we'll create an object of that. So this is our personal permitted analyzer. So we will see examples of boat with our passing this and with crossing this So let's begin in our normal. So this is the world stuff. Uh, what would we have missing? And Revis menus you're importing. We're saving the stop words cast forming thing, this language. And then we're using the pork porkers Timur. And then we're reading this a stamp, Alexa into a data frame using partners UCSB function. And we're creating toe columns, living and message. And so let's going good and wrong the 1st 5 road of the data frame. So the year, the levels and this is the message. Then we defined a function toe clean the next. So this is also the usual stuff that we have been doing here. We're removing the punctuation and we're splitting it into tokens and splitting and is done or non word characters. Then we're are doing stemming on all those words and also removing the stop. What's so? This is our cleaning function. We will use this after something. So first, let's look at a sample corpus. So busy our sample corpus not There is a Mrs Spam collects indecorous. Well, look looking toe this also. So in order to understand conflict, treasure here were imported and then we're clear. Did it using the all the different perimeters And this is our corpus. Just this three sentences. So when we do see with your fit CB's this contract treasure, it will learn ever Vocabulary dictionary off all the tokens in the road documents So all the tokens in it it will learn but it will not yet two years in the document or metrics that will happen when we do transform so it will just learn it So in this sisters, if we come in, total the cord and let's go ahead and run it So we see their first a printer vocabulary So vocabularies this dictionary. So here you can see all the unique words are listed here and it automatically by default will get rid off anyone. Character words so is not there in the list and rest other will be here, so it sorts them in alphabetical order like document has index of one another edge index off zero, then here as index of toe. So Dizzy Dick Smitty off word and you're corresponding, Index. And when we call, get featured names. Then we will get this list off. Tokens. No, let's do transform on this. So when we do transform it transformed the documents to document of metrics so it will now calculate Oh, this Documentum metrics There we will have as columns. It will be the unique tokens and one droving respond to one document. So in this case, three documents are there. So three rules and seven word serves for seven columns. Andi individual cell values will denote ho many times a particular who can has occurred in a particular document. So when we do transform than it treats, then Documentum metrics food and we go ahead and run it. And we printed the shapes. Who begins? He then to save his three gross seven. Because there are three documents that is three sentences in the Scorpius and there are seven unique words. So it's called a milkers contra net. And then finally, when we just x it s purse metrics. So you can see here also that most of the cells are Jiro and this with a small example and I air scooped or dis example to repeat some tokens multiple times in general. If Italy large metrics than most of the cell values the original because, ah, a general size document nor document. But the corpus of document bill campaign, several intelligence or even hundreds of dalliance off unique words. And where's one sentence or one document? I will not be there big, so it's born to contend Jiro in most of the places. So when we do this print EXE 100. So it just drinks the police. Since Richard non zero foots internally disperse metrics and it's printing this. So it's not a metric started skating printed in orderto get their data. Three. Mortar there in the metrics form the afternoon to every or if we want to create it off, remote off it, we can do be us and looks Import partners, I think order the important partners who we can definitely use it the D dork. Get a friend, and then we can cost this X, nor Harry and columns. Also we concert. We'll set this feature names of the columns, and then we can bring to the vendor to free him full. It's run it? Who did? A frame constructors not properly called. So here you are. A religious wants him. No, we get back. Visited offline? No. Uh, we have got a feel of what is convict treasures. Know where we're ready to use it on our some burger dessert? It's him. It's a spam Gullikson gone to the outrageous and onus Imus Span collection. So here we even create in new conflict treasures. And here we will use our custom text cleaning function. So in a later, we will ask this clean text for listen. So it really use over clean fixed instead of using their before it and a laser and no alerts. Ditto. And here we're done it in two steps once with did call for it and then transform. We can combine it with this fit transform in one step. So let's go ahead and use free transform. And then we will call it on this MSG column so it will take them. Easy column, then also, apply the cleaning folks in that we're passing Emma Xena and no, let's print the shape off this that no fuller tronic, clean text, nor defined. So I think I am not running. We need to run it. Ones. Yes. No, Let's run it again. Obviously that it has 572 rules. Notice directs. I save seven to documents who deserves 45 72 messages. And then we have 8713 unique tokens after cleaning the tokens. And it's the sprint this eveyone north get 32 names. It will be a user list through here in our sample data. When we printed this feature name, you just printed this list off. Seven Token smart. Here we have year 713 tokens with you printed. So we need to call it on Siri one and then only So this is a big list and you can see even are some numbers And then, uh, been ward source of. So we cannot work on this. Alerts create assembly, does it? Or different. So we will call it data sample. And we were thick first and rode off this original or data from that. Weird. No, we really find Siri to and no looks. Copy this and no, let's run it. So here we have just in rude because we have selected just 10 road off our original later and it has won 31 unique Oh, O'Quinn's so we can go ahead and drink the data from that's created data free. More perfect PD North. Ditto three, and we will call ex door headed and columns. Hillary, Uh, this See we Nor can get features names, so it will receive it too. Onda and blue. Yes, Trinh to this dirt off room. So this is your data from you can see it will come in 1 30 even columns which will transform to unique tokens. And you test 10 Dru's so in some in first document or first message this world is one times work is one times So let's ticketing over this message. So we see their care in the first message. We have a world, and then the everyday fluidity. So which is correctly reflected there? So this is the document or metrics and we have created using the contract treasure in the next. Really, we will see or in ground trays and and then sort of there we will CTF idea. So thanks for watching. Seeing the next radio 16. N-Grams Vectorization NLP: In the previous video, we saw contract traces and which is one of the most simple raise off doing Victory Ages and to reveal the study vote so we can direct trades and using Ingram's. So let's Begin. Ingram's also creates your document or metrics like we had seen in the case. Off Contra Trade is and where documents are the rose that is individual sentences and columns representing the unique words presenting entire corpus of those documents. So here sells a still represent the corn, similar to contract traders and but instead, off columns being the individual words, unique words that are present in the corpus. Here the columns will represent all the columns and columns offer distant worlds offline in . So, for example, this decision over given document or sentence. And it's the are using by Graham. So by Graham, in this case in a little bit, too, because you're to Bagram, then the columns who division we have to pick are just send words to after times. So I am. This is one leg room and then I am studying. He's an under and then starting in, it'll be is another one, and these are all unique among themselves. So these are the three columns were just looking at my Graham north. You need someplace waiting room relief for program and will be three. So we're really looking at a three year descend. Works after time, all such company since meter unique. So one column would be I am starting and the second would be from here to there am starting in other people. Similarly, for four gram, we will take four artisan world such time But descendants are just four work so it will be just I am Mr Ding and that would be so This is the concept off Engram So let's begin in the normal So it will be very similar to what we had seen in the case of contract trade is And here also we will be using contract treasure only. But oh, there will be some minor differences. So this is the exact same nor book I had copied from the previous milieu. So this reading the road takes part will be saying Let's run it So this is the road text and we have level it the levels as level column which will contain either am or spend and the second column name. Name is evil contained actual messages and here we have more than cleaning. So let's define the cleaning function. So here there were really a minor teeing in the leaning for some. So in the previous video we were passing this cleaning function as an a laser toe. This contract trader more tear. Oh, this needs the complete sentence because it will, depending on what ground we're selecting. Depending on this in, it will be picking up itself. What are all such unique? Communism's off for a jumper. Try ground. Then it will pick all those trillions from a given sentence. So there are Traditionally we were removed the punches and then we will combine it and then we will split it in towards and then get rid off stop words. But after getting rid of the stop words, they will join all these on places space between the words schooler to enjoying it. And no, that's the only thing that we made in function. No, here we will explicitly a play the cleaning function to the second column and then pass the clean text to our contract. Traitor. So you're let's create in new column call it. Emma's clean. And then we will do the cleaning on this second column, which is your emergency. And let's bring the 1st 5 road off this. So we see that Oh no, this article me also this sentence instead off our list off tokens and it's clean version off The second column very are three off from tourism. Like here, you can see this coma is alone. Similarly, a Triple Lord. It's going here and similarly here. So it's three from a near until he's gone, so stop words are also removed. So this is the clean fixed. But it is intense. No, and nobody really passing this to the this clean. Send this to the contract trader. So this was the sample corpus. Let's trade this also. So here we will need toe pass. One additional perimeter on that will be in ground arrange. So if we don't pass anything, let's first run it. So it really just ah, deformities one. So it will be column off one words only. So let's Treo other communism's. So is reports and Graham range one common to then it will look for all one grams on all by grams. It was one comma tree. It will look for all uni grams by grams and tri grams. So let's were already seen Uni gram in the last, will You, which was contract trays in case there the Ford case and no, let's trail by Grams. So here again we will do Let's this time trey free transform So we don't need to call this street and transform separately and let's run it full. You see There, it says, connected all the by grams. Another sentence document is Is another is here is sentenced sentences toward document. This is so These are all by grams and it's ah safe. These three rules three rows corresponding to these three documents. And there are eight unique or by grams Chris morning to our descend words. So this is who will be there. So this is easier. Similarly look quarter or all other communism's of artisan Bagram's and it will represent their efforts unique. And then on Oh, let's they will not need to pass this here so it will lose knowledge should trade this by you and Trey Graham boot. So we see their tail b a, my grams and triggers. But we don't have singing single words because we're starting from to the same thing we can are we were, you know, example. On sample later, we will just look for by grams and we call it C one. Um, here we will call in music clean and we don't need to run this. Let's straightaway run on the sample leader full for more. Let's turn it it has 10 grows on 26 on and then let's convert it to data from and printed. So we see that older by Graham sir, there's you test towards we never as I understand. So the first sentences we look it has world ill a full. It's very fight. So you can see here you lay graham word in a similarly other by grams would also represent . But since we have won 26 columns, we cannot see the entire bi grams. So this is whole. It's different from, or convict trade is. And so in the next video we'll see another wreck traces and technique called deified. If so, we'll see you in the next video. Thank you 17. TF-IDF Vectorization: in the previous. To reduce, we have seen contract, treasure and in ground select treasure. In this video you see another outrageous and technique called PF ideas. So it also creates a document of metrics similar to count the treasure on grams. And here the columns are individual. Unique words simmers oh, contract treasures. And in Ingram's Reese ordered or columns were all the combinations off our distant words. All sides in their end, will depend on what? Graham beer using. So here it will be unique words Samos convict and cells instead off containing the frequency off each word in a particular document. It will no content, different weight which political calculated using this formula. And it will do you No. Every tree signifies how important award is for an individual extra message or a document. And we will see why. How is this raid signifying that? So it depends on oh things Here we see that one time Mystere four easy. This is number of times at home I occurs in a document D derided very corpulent off tons in jail. So is we have a sentence which contents? Five words and a word occurred two times in that sentence. Then it will be studio into a sleeve and then we have that multiplied to log off. In their opinion is the total number off documents in that corpus and deify is the number off documents which contained the term I. So let's eso What if ah, term occurs very frequently in any document, Then this numerator will be number of pains. I occurs in their document G, so it will be high. The red right total numbers off terms in jail. So that is occurring very frequently in that document as compared to other words in their document. So it will be high. Is the frequency off awarded I in a given document and this is in their denominators. So it's dizzy, a smaller, this complete tumbleweed larger and this time, generous number of documents containing I. So if this word is not very frequent in other documents, so if the core processed in present or sentences and we're looking at the particular sentence were sometimes my easier cream, but we see their debts not very frequent. In other terms, that means that word is really used that is not commonly used. That's where it's not presenting other documents, so it will be smaller and overall term will be larger. So if the word appears more number of times in a given sentence. But it said Edward, then this complete term, very large saw the blue eyes in that case, really not how important the world I is for different shooting that document d oh when compared to other documents. So let's look at an example. So we're given, let's say idiot. A document in a given corpus is this. I am starting an LP, so first we need to carefully. Year, fourth M. Let's severe calculating for this. So here I use this m So this is occurring just one time here, so it will be one divided way. Topol number off words. 12344 So it's zero point. No, in is 200. So this corpus contains 200 documents were given this and B if off M is too sore, be if denotes number of documents containing a so it denotes that out of those 200 or sentences or documents on Lee, it's presenting two sentences. So it's quite rare world for this corpus. So no, let's schedule it or the blue AMG visual Ouda value of selling their document or metrics. It's, ah, 0.25 and multiplied by Log off and is 200 and be a famous, too. So this will be 200 divided by two, which will be under and log 100 will be, too, so it will regional point flight. So now we're ready to write the court for this really very similar to our contract Treasure Corp. So the first part is reading raw text, which will be same image contra treasure. So let's run it. And here we are, dividing the data from in two columns. One is the message, and the second is the level containing ham or spam, and then the next cleaning part will also be same as contract treasure in the N Grams. We had mortified this text cleaning slightly on. We were joining on the world's clean works after removing stock words, the list containing Georges oh clean text we were joining in two sentences suppression by spaces because Ingraham expects sentences so that it can look at all the in adjacent words in the documents prepare. We don't need to do that here. The columns exactly it Same as Contra treasure. On Lee, the cell values represent a different rate. Whereas in the convert treasure, the cells represented the frequency off a given word in particular sentence. So instead, off contract trader, let's change it to do you for you to use victory area. And it will also representing this escalator and feature extraction or next. But it will be here for direct traitor and looks. Create an instance of dirt and call it you fly, do you? And this is our dummy corpus which contains three documents. So we're fitting the if I direct treasure on this so it will create a vocabulary. But it was still not kill validity If I devalued, when we do transform, it will do that and clear their documentum metrics. And this is this person metrics because most of the terms Or did you so by default it will be a spurs metrics and then when we want to convert it to data Afrin So we bring this to where it will convert it to a full sized metrics and then we can clear to editor free more than that and bring to their So let's run it we see that on the zone domes. The columns are exactly Samos contract treasure and the third individual values So the same as what we had calculated in our example. And this is the data frame which contains the the authority of the loose for each word in a given document. So no more. Let's do the same thing for our sample. Or does it fuller to skip this part. This is going on the complete corpus Mitchell retool big flow for our case. We will be picking a first in messages from on this SMS system collection, does it? And let's name it DF ideas, too. And it will be here to your scientific treasure and we will pass here over clean text function So we have no transit. So let's first run it. And then we will called the transform on this. Do you? If I knew, you have to Let's run it. Ah, so it says it if I live to is not defined, okay? Narrative type Where? So no, you test 10 rules because we had selected in rule off the entire day Test it and it turns 1 31 columns that spring did Oh, that spring. The data friend. So this is the document, um, metrics. And here we can see the columns are individual words. And in the first sentence, these terms known Tucker. This deluded t occurs and it's 0.25 And worldly also do point to flavor. So world and ability. So it really ready discloses to mentally verify how many times this, our ability and world has occurred in other documents. But let's as you it's doing the right thing and it's giving the correct document. Oh, metrics. So we saw three tapes off treasures, one contract treasure than we saw in grams and then we saw deified. So all are very similar to each other because here deify Deerfield's similar toe contract riser. Only thing is that instead of directly representing the frequency off a word in a given document, be also picks. Hopefully grandly. Oh, this word is occurring in any other documents, so it will relatively filter out, um, the words that are more important to a given documents, so this weight will signify. There There is. In Ingram's we were in stirred off picking single words. We were picking all the unique adjacent inwards in their documents. So this will end over over traces and part. And in the next video, we will start on feature engineering there where we will try to create some more relevant features. Well, they're not reflected. There were here. We are taking features that the words So there may be other features which are not presented, and we will try to clear knows. So see you in the next will you think? 18. Feature Engineering: Introduction: know that we have traced over data treasure features. We're ready to train our machine learning algorithms. Look before there. Let's new feature immunity. So how is it important? So we have just created off or the features that the data directly has given us. In our case, we have traced all the unique words presenting the corpus and created and documentum metrics are than that. And we also saw three techniques off Tracy. But something that makes the algorithm look learn even better is creature engineering, where we create new features we should not directly visible to us. But here we will use our doorman or less a word, the date other we have and make some new features out of that. So here I am emphasizing, are creating new features and transforming existing features. So for creating new features, I mean one important factor that maybe he'll pull in classifying a text. Messager edges spam or ham, maybe the length of documents. It may be possible that or the longer text messages are more likely to spend then smaller text messages so it can be one such feature. So these are not directly given by the later to us. We are using our knowledge. Then there can be another feature like what is the every word? Say you eating each text message or document. So is there some relation between spam and ham on that? Then maybe they're spam messages used too much of punctuation. So if they look at what percentage off directors in a given expresses its populism, that may be a useful feature for determining whether the text messages stem or have another could be home frequently the world's leading than document are calculated. Are these properly kept later? No. So that can also be an indicator. And then there can be many more such features. That all depends on how well you understand the data. Then we can also transform the leader and likely make your learning algorithm work better. So one such common transform, Ihsan, is our transformation for young built Oh, I do X square or Squirtle defects or maybe X cubed or some other power fix. So these can be our transformations. Then we can do standardized. You know our data. So for you, humble me may have some later on a normal skill. We have Jiro Pill 20 and we see that the details. Too much skilled on very long pill in Utah, most of the details here so he can do some log transform on this and bring it mysterio. Bits of what, uniformly on that new skill. Maybe you 2.11 didn't. So this looks much better than our original little off. So we have changed the skill off there, and that looks much really suited. And then we can do normally use, um so by this meaning we have 100 features. But there are different skill. Wondering is from 0 to 1, whereas the other can be from 1000 to a 1,000,000 rains. Who sometimes bringing those data some common drink, maybe geo 200 or digital to one, those kind of things. But then the over algorithm is likely to learn better than or know normal, isn't it? So this is just a brief introduction of featuring you nearing its a very big field. And in the next video, we'll see some examples off these features creations in our core. And then we will also in the further really um we will do the theater revaluation that that is, we will see whether our feature is are able to get some distinction with being the spam or am plenty of messages or not. So we will do some flirting and and then see how well it's able to distinguish them. So see you in the next video. Thank you. 19. Feature Engineering: Feature Creation: In the last video, we see a basic introduction off feature engineering. Andi I said that future creation and transform ISMs are very important for machine learning algorithms to work better. So in this video, we will look into features prison on. We will see it in the court. So we saw that some of the examples off he just creation could be message land punctuation , hogan LePen choosing you are frequently spoke. Words are used, you lose or text messages. Then how frequently the world circa plays every world lived and there can be many more. So in this video, we will are try toe add a court for these two so we will see a message limped on punctuation use it features you know we're normal and you are You tried other feature surgical and try to see whether they're related to the classification or not. So let's begin. So the first thing is to read on the text on here. We will do feature engineering before under a cleaning because they're cleaning. We do remove all the patricians and stop words and the message lent to yourself changed by removing boots, punk prisons and stop words and Spacey's So we were looking for anything before we do cleaning. So first people duty. Well, it's funny. So this is another are excluded Normal air. The new feature which will leave Masses land. Let's create in new column. You know, we're Ditto for him. Alerts called it Live and Message e Lim, which will be storing the length off message off individual lock much. And then we can apply a lambda function on the second column and musical. And then we will let off X. And we will subtract the length off number of white spaces because these don't contribute much to it for its or let's leave it. So we're just storing the length off these messages in this column. And then let's see the 1st 5 rows on this near different. So these are the number of characters presenting this message. No, let's create another feature, and we will see how what is The person did your ponchos and reading and given document. So for punctures and we need to import stream on gun, we can define crystal songs in Jewison Count, and it will lead the road text. So for every punch reason that we find you like tree troll the characters and if its opponent using, we will return one. So are finally we will hurt all those ones to get the total calm. 14 c in txt Yes, see you string Norc on tourism. So in removing the conscious and we were doing just the polluting, we were returning character for each character in text. If their character was not in a strainer. Punches and airy are counting produce. Um so for every character Regis Punctuation, we will be returning one andan Finally, we will something all of all the ones it's And then they will be returning. What is the percentage of this? To whom Don't do it anyway Let off txt and then we can multi player 100 to get the person t andan looks creating new column in our data frame and we will call it choose him percent and for for each of these documents we will pass it to our function countries and card. So we will get a new least with the percentage off volunteerism in each of those documents . And look see the first fire grows off the new tough name. So this is the percentage. So first, our document has 20% of politicians second as 3% the order is 12%. And Sean and let's around it, we can round it to some or two decimal places, your or three decimal places to look more appropriate. So here in this video we saw or we can have new features or over discovering over off the trades list. So then we will do their tracing on these things. In the next video, we'll see home. You said you ought to be teachers that we have just created. And we will try to also do a lot of those to see whether there is any clear distinction between the data or not, depending on these new legally Richard features. And that we will call featuring real reason. So see you in the next will you. Thank you. 20. Feature Engineering: Feature Evaluation: In the last video, we created two new features which were Message Lind and the amount of punctuation or sentence upon tourism used in the individual documents. So in this already we will evaluate how good those teachers ours So we will float or the masses lint. And the conclusion percent is for both spam and AM messages and see if we can or c a clear distribution. Which difference here too salutes beginning over nor book. So for plotting, we will need to import be report from a portly and then we will import number years in P. And this lane will tell their just bloated a print or the net lordly write in the notebook . So let's begin. So first we will loopy report Dort Instagram and we will for sport for spam on then for him . Sonita, We were living. Is it cool to spin Andan? They will pick this. I am a Xylan. This column and then we will need toe. Tell the beans. Let's first define moment. We will use number isil in space, so we're assuming that the length is not more than 500 and we will create beans and let's lorded. Let's do the same for em first for to read him and then we will see the very poor dark. So alert Sermet full. It's heard some religion, some level so that we can difference here between these rich. So most of the messages were am So I guess this one is for him. What? Let's Florida. So we learned their levels and the same here. Let's also normalize the to food that we know that most of the messages were him so so lording them on. The same graph would not be good, so let's normalize them. So we will see Lord equal little truth and see me and let's see over the port in use. So now all we have some reverted. But you see that some guitar years beyond 400. But most of them are below 200 or 2 50 kind of thing. So let's reduce it Saves who we get more space for it. Yes Oh no. We see that beach bloomers are quite separate. So mostly dear ever 100 And these orange his total imports are ah extend upto 200. What? Mostly it's concentrated Blue 100. So we can see there did some our distinction already coming up full, hurts our religion through this. And it suits Lucas in his Oh, great or operate. So the are the spends the blue ones and or anyone's air ham. So no. From this we conclude that oh, this FAM messages Havilland more than 100 or are larger inland, and the hams are off smaller inland. So this seems to a pretty useful feature, which the model may not have learned by itself. We're providing This will help the model in different seating models. Uh, messages from ham and spam. No, let's do the same for So you're it landmasses length. Let's do that. Import punches him. You will root. I want us in person for Let's copy to Rear. Let's suits maximum number of punctuation says it's 100. I will start with 100 Litton Then I just and the column name was unctuous and percentages oh, remaining it will be same. Salutes is on it, so mostly it's below for peace alerts change to 40 looks around again, even less turkey. So we see that there is not too much of difference. Thesis spam messages are more concentrated on. It looks to be between do and maybe eight or seven years. So mostly it's between you and seven. Where is the M toe? Oh, slightly more distributed. And it extends off 20. But there is no clear or different season that how much they differ. So this may not be such a good feature, but our earlier this message Lynn seem to be a very good feature, and we can use it to another, um, training the Morgan. So that's all for this. See you in the next to do. 21. Feature Engineering: Transformations: in this video, we will see how to transform our data if it's not properly distributed by not properly distributed. I mean, it's our dictates. Oh somewhat s Cube. So it may be possible that it's skewed towards right and there's a long right pill or it may be possible. There it's a skewed other, a very long left pills and kind of this. Then our aim will be to make it more uniformly distributor or looking more like a normal distribution somewhere. Quickness. So by transformation, we mean changing. Used to deter point in a certain column to make the distribution more closer to a normal decision. So we will apply some power transformations for that, so some of them common transformations. Our transformations are Turkey's transformation, which is expressed to the power of Lambda. So is the initial value. So we define some lambda aan de Rijn. Lt's X rays to parliamentary flandres retreat and you and minus off. This is interested in you. And if Lambert zero we're defining logs. Otherwise, if we simply put dear where it will be one and for every detail will be one. So for did you know we have a special case. Similarly, another popular transformation it box cox transform ism. So this is extras Toolbar Lambda minus one, divided by Lambda and for the roads logics. And who in fact, this logs can be dragged from this because we know that Ah Exeter's to the poor lambda minus one. This can bitterness so we can reach it. Eaters to the power Log off X rays to the poor Lambda for because ah, it is to the barber. Log off this E or Illan something. You it seem it key because you protect law Gacy on Woodside it will become log Ikea And here also Tilikum Lord et so we can write kids dearest to the world Log off e que or the also right that ill in Fort Natural logarithms from minus one dueted week Lambda and we can write either still of our X surge one plus x plus x squared urinary two factorial An excuse three factorial and so on. So let's ignore the higher order terms because we're looking for Elam recorded zero. So for a smaller value, we can ignore it safely so we can reach a dead one plus the respect of our for one plus log off ex Bristol board Lambda minus one. Ignoring Iroda terms in this minus minus goes of it And this lambda will come upset because it's in power. So Lambda Log X diverted way Linda and it will come logics Which is this value? No. What is the process off? Applying transformations So first rigid remained the range of exponents that we want to test. So for its usual toe, do it in the range of minus flee two plus five. So we will use our own custom ring. We will not definitely one of these, but we will use some simpler transformation. So people apply transformation to each value of the Children feature. And then, ah, he's done. Some plot became port. After applying the transformation that four of which transformation, it's look more uniformly distributed like your normal distribution. So let's begin our normal. So this was the state of her notebook are burned off Lick Theron speech revolution. So, no, we will just float our two new features. So let's copy tear. Let's run it. Um then they will define the been similar to earlier and we will plot for IMA xylan first. And here we don't need to compare for Spam and and we will Plourde all the later and then let's Who I to live masses limp. No, let's run it So we see that this is the distribution poor Mrs Lynne. So this is not really skewed like we saw in over example. Two. It's kind off. Oh, you can see by immortal kind of decision. It's on its spurred somewhere. Ah, uniformly. It's not a perfect candidate for that transformation, So let's trade on over another feature. And here the rain here the rain was between zero and 2 50 Mostly in this case, punctuates in person did. It's very less it's under 30 so we will change our Ranger. 30. Let's put 50 and we will use 50 beans for that on DiScala Miss. On choosing presenters alert. Sporty. So here we can see that under days killed and we see a long right till so we will apply some transformation here, and we can see that here most of the ah messages have punctuation or scented off it on 54 before or senators. So it's a perfect candidate for this Mr Transformation. Pulitzer defined some powers, so we really use a square root. So if we do squirrels, we will see their Ah, most of the distribution will be around two or three toe like here it's closed toe four and five so square root of mad will be too on and ah, you too affect these nous these real loose then. So let's see it in the running court and then we will regulate it. Let's work some powers. 123 Let's put one more 1 to 4 and we will first do the normal part and then squared off that. So we don't need to do the normal port here. And you told of this and 1/4 power of this and let's see return gifts would result. - And here we will. We will print what is the value of transformation? So are this world original with power off one, then the square root of this is this transform which looks someone better than this, unless a skewed so as expected. You see that here? Most of the values were their own 4 to 5 square root of that will be run two and between two and 2.5 and you can see it here Similarly Q group will be one point something because ah, if they do to off to, it will be so. It has to less than two, but more than one. So it's somewhere here 1.7, and similarly for one foot, it's even closer to 1.5. So this 1/4 looks much better, not stray one more. So there is not too much difference between 14 10 1/5 So we can go. We can pick up this one force. It looks good. And here this is some messages have in fact good enough number of Mrs You have Jiro punctures and presented so that will remain there in or whatever power transformer a play. So this is how we can convert a skewed their time to more normal kind of the student later . And it will help our machine learning algorithm to learn better instead off or debuting and trying to focus more on this rescued on long tail part of the distribution. So that's all for Sicher Engineering. We will know start our training our machine learning algorithm. So see you in the next. Will you think 22. Evaluation Metrics: Accuracy, Precision and Recall: No, we haven't done all the hard. What can? We are ready to build our model, which will predict whether given messages, spam or not. So in this case, our mortally my entry classifier and we need some values and metrics. We evaluate our model whether it is doing good or not. Asper the requirement of the business. So let's look at some of the Villagers and mattresses. So when it accuracy so, for example, in Syria in messages he never did closer and our model correctly predicted the level off eight of them. So hear Chrissy's eight waiting or pointed, or we will say that it is. It is a few percent accurate precision is a very close toe and normally trees. It may seem similar. Bert Ah, precision muse data, let's say, or both in it said that seven or spam, but out off that seven, only six were actually spam and 11 wrongly classified as spam. So six words the correct spam or off the prediction that had made So it predicted. Seven is from six letter spam. So this is border too. True fortitude that this correctly predicted spam poor six out of seven and seven are the purple genetic predictors spam. So the new monetary is important and similarly tricorn numerically the same to it predicted . Seven. Antispam has put the more than 70 to spend these enormous span, and in reality or of those heaven it works. Six correct and one wrong. So Number two is again six here and in a motorist Topol numbers of messages that are actually spend. So let's say, actually, there are nine spend messages in the desert, so here it will read nine. So you see there diagram this inter tangled region really content in production values and this circular region contents de knocks the messages that were predicted it spam. So here, plus means I was doing this case is to spend our aim is to find notice Pam's. So let's say this circular reason Geno works all limits is is that the mortal surgery spam and our said the seclusion or all the messages that the model Syria nor Nana spent and all of this circular region somewhere correctly leveled. Some are wrongly liver, so the correct level are in green, the left side of the circle. So this record through true for two and these were false positives. Similarly, for been negative examples, that is the model sender deter North spent. Some are correctly level and negative. Summer incorrectly level has made so busy, red ones are the false negatives and Bizerte two negatives. So it's 31 to visualize what those precision and recall meaning in this context. Then you will see that six services six in this case, because the more this secluded or seven and this is your angling denotes 10 because there are total in predictions. Circular note. Seven. Left side due North sixth recorded correctly predicted six for 42 years. One here because it wrongly oh, classified one added spend. The richest was north, and outside reason is three. So precision will mean six. So six is denoted Vader's green circle green part of the circle. Do it beautifully being pair certain because this circular notes all the messages that the models here or did you leave them and left is the They're traction often that only rich correct. So this denotes precision. So there's this precision. Now let's see what these recall in this case so we see that numerous very same in both the cases, so numerator remains the same on denominators changes in this case. So, denominator, is that it? Actually spam. So this part off moderate assert should have been level. Is spam this? So this left off of the rectangle. So this is recall no, Lord er to see some example. So let's say we have six messages to classify these. 123456 dinner. Been messages. And the ground troop levels are spends them non spends them. Norman, spin, spin. And these third line denotes the ah, what the mortal predicted. So in this case, received er this one is correct. This one is wrong. Looks it. Aged warrior. So this is wrong. And then this money wrong. This is correct. Correct? Correct. So it could have seen me. It got four out of six. Correct. So it Chris, is this No. 0.6667? Now, let's see what this precision precision is are open. Number predicted is spend that are actually spend. So there, uh, span should be 1234 and Nana spends would be to and, uh, the model predicted 1234 inches. Spam for this family from their three year. Correct. This one is correct. This one is cooked. This one is correct. Only business plan was wrongly classified. So the model ah predicted three out of four. Correct. It predicted food as a spam order. Their two year correct. So it's precision is three very four or 0.75? No, let's see what is recall in this case, numerator remains same. Three. We all the model pretty predictors for you to spend more to the trees. Correct. No de new military notes, the actual number of messages that respond. So in ground truth, we see that four Messager spam so denominated reason that so. Although the recall and precision are same in this case, these fools are coming from different places. This four is coming from this predicted a desert counting predicted or does it? And this four is coming from ground crew. So in ground truth told, the number of us damage for so these force are not seemed inimitably were different for precision and recall irritable we seem so. These are a strong even recent metrics for our model and you need toe are just no precision or recall has for the business requirement So, for example, is the We have a email spam, plus it for, but it would not be too much problematic. If you get a few spam messages in your inbox, you can manually delete them. That's not a big problem. The big problem, of course, Woman or one off your lawn. Spam messages are classified span and put into your spam folder because that may be an important Mrs in or it may incorporate be close to you. So in this case on, the precision has to increased because of whatever classified just spam. Even better we spend so the model will be or predicting something as spam on leaving its dead seal that it suspend. So here we see that there are four spams. What you told achievement more than just finds one of them is spam, and that is correct to, in this case, precision Would we want. But recall will be list because in this business requirement, it's okay if you allow awesome for the negatives. But they're so not re falls 42 because their marriage old in your Norris family, um, is getting into spam full of, in this case, moving like toe improve the precision. Similarly, on the other cases, false negatives are costly than you like toe improve the recall. So depending on the business requirement, you like to wear just one off on these evolution metrics. 23. K-Fold Cross-Validation: We will continue our discussion on model level reason, and we'll see a few more tools for moral values. First, let's see what the whole lord missed. Six. So these are the pistes. It's better again from a sample of the desert, and they're kept aside for e values and purpose and there not used in turning. So if you have, let's order does it off in 1000 data. The newly so decide one key for even letting your mortal and you train on nine kill from data. So this will recall whole or tests it, and this will be used for evaluation and Norton mortal fitting. So it's used devalued ability of mortal to generalize. Don't unseen later because it was trained on this 9000 data, so it has not seen this data. So the ultimate goal of your model is to learn something and to apply that learning on new data. So that's where we keep aside this one case separate, and we don't train on the entire data, so we will get an I d off how good the model is doing on new data, and let's see a technique off modern liberalism called K for cross validation. We will be using it in our for the reduce for evaluating our model. So the main concept here is dirt. Full data, Surtees divided into case abs. Urtz. Soon there is care. So if it's faith, we really right over did research into five equal subsets. And we will play her lord metered on each of those data. So we saw that warlord pissed. Is the data set part of the jihad? Is it that is used for your listen in north in training? So no, we play cross validation. So we were divided into case abs, Urtz and each meal actors Well, don't for one night treason. So this will be holo and the other search will be used in training. And next, we will make this whole lope and other subjects will be used in training and so on. So there's really peter times there each of the day or deters its substance will act as a whole load, Mr. So this gives more robust is to estimate of the model because it's quite likely that you picked nine K and maybe for this one, K performed real evil. Maybe you would have if you had set aside another one k on included this intern. Then you may have got some different values and metrics. So it trying it with their different subsets. You get a more real estimate off your gator off your mortal, so it gives more West estimate on and don't need to manually implemented. It will be handled internally, very secular. But it's always good to know things are working. So let's take an example of for a full cross validation. So we saw that Whatever reason, here we divide the detained toe, that many subsets. So here we have, Let's see in 1000 deter examples and case fight. So we will do it. 10,000 in two. I told you. Chunks so intelligent, every rated. Okay, so this is one toe three full site. So we have divided split listing came to five sets subsets Now in the first treason. So we will repeat this process eight times or Lord process Who in first I treason will make this just sit And the others Reuben's training Sirte So does this train this train? True? So we train on these elected editor and based on this first substitute and we will not change daughter of substance Deezer oh, are split one time so in the first treason. Besides, just let's say, regard and equity off Chris is just one perimeter for you, Alison. You can have other parameters. So let's say you go 91.4 or 0.915 Let's keep it between your own one. So that means it's 91.5% a correct. And in the second, a treason. They will make this just and in the third this just just and in the system. So every bare desert with actors, it just sit ones and training Sit four games. I'm phone so easy to doesn t access it just certain wants and lets it this you so accuracy off point Richard, been next to give your name been Lane one, the 93 So if you're a big bear is it will be or something ever of 90 close to 90. So you can say that on and everything this is a much realistic estimate goes. You see, the maximus 900.93 minimum is 0.80 so there is a 5% difference here, and 5% is good enough. So depending on which Oh, subsidy Pick as a just sit warlord tests it. You can see a were difference of 5%. So if you take the average off all such fame, then you get 90 and that will really more realistic. Oh, you re since your morning. 24. Random Forest - Introduction: Let's study or the popular machine learning algorithm called random forests. There are no forests are a type of machine learning algorithm that false 100 broader category off in symbol learning. So let's see what is in symbol mental. It is a technique that creates multiple models. We call these weak models, so we create multiple week models, and the each model will give some more. And we will combine the rope, toe produce, ember treasure and this bigger model vehicle stronger model or better model. And these smaller models will be Vicar Morton. So we combine week models to create a strong model, and we take into account aggregated opinion off many discomfort to a saluted opinion. So each mortal really deep. Some opinion on the final result is taken by just simple working. So whatever decision is in majority, we will take part so off a random forest that you can see is quite related to in symbol. Learning in symbol. Learning is the broader category. Under and post is just specifically some that so then afforestation symbol learning methods that create collections of decision trees. So for the student collection of trees, as soon in this diagram, So the trees, her decision tree and then aggregates the prediction off the street to make the final prediction. A simple example. Can Wielaert We have 100 decision tree models? So these air the weaker models and 70 of them predict our document. No, we're classifying it as spam on 30 of them, predicted says. I am so we see that measure. It eases spam because seventies more than 30. So the final prediction would be spend simple working. Well, let's see some of the advantages off random forest. So when is dirty can be used for classification as village reason. What Here? We will use it for classifications. And it's robust tour players because some or players may affect some of the models. But ultimately we're taking the our decision of the majority so or get rid off or players and missing values also. So it's also was to over fitting, and it all sort works feature importance that beach, which features really useful and which features are not useful 25. Random Forest - Building a basic model: know that we know what a random forest is. Extractable basics. Random forced model. So this is the same as order of year reading over. It's a Mrs Spam collection detained toe data frame for Let's run it. And these This is the original format of the data. We have level either Hammers, fam and the message. Then we went ahead Another who? New features Mrs Lynde and punctuation and then really find a function to clean the data, were we or go treat of punk traditions. And then we got rid off stop words, and then we've a trade or later, and we used. Do you find your victory here for this case? And no, here. What we do is that we have on the victor. So here the input will be for earlier we heard this extra message and level and liberal world leader hammer spam. So no, we're the message NVR did message Elin on punctual ism percent. So here you see that we're picking masses length and punches and presenters and the victories for most distant later. So after that trade is and this will become all the words that are included in the data is it. Lots of you have five words a, B and C on me Notes Keeper three Phillips said These are the only three words Although we have Lord more So each document will be containing Onley one off on Lee Some off these words So our director will look like this. So we take this straight off rim excluding the level the upend, the message Lind It can be Ah 56 10 and so on and presented or presented if it is contusion maybe 2%. 7%. 10% on So now our exes, this part the entire data from containing off victories representation and two ordered features for Decision X And this will actor that level that this way? No, let's continue. So here we can, Kurt This my sibilant and confusion person to the but raised, represented and alerts There are new to see what its former So you see this Jiro 123 Till this eight between nine these other different words and whether it is contained in the dark or not And then or we do Mrs lint and punctures in person. So this will compare it off over export A. No! Let's ability is end of forest classified. And then we will run or cross melodies. And on this, so we need to important a first class differ from a Skillern Norton symbol. And then we learn toe when we need to import careful and cross Vela school again from muscular, more mortal Seligson. And then we were deplane. Careful, this facilitates no que fold cross validation. And we need to satisfy how many or subsets to break their data sitting too. And another example, We are seeing fights. We will keep it five year and then we need to define our random forest in here. We will assist for just one perimeter in jobs. You can skip this also. But this end job since you slurred or their different decision trees are built in parallel and so this will run faster. And now we need to run the cross validation on it. So we will call across well, school and it will return to school. And here we need toe fill the model that we're using So we will pass the Random Forest Morton and then it needs to know about the input features. In our case, it's X this one and then the liberals, which is there you go live in. So that's where the air door removed it from this input features list because you're passing it separately and then it needs to know ho toe, do the cross valid ism, that is, it needs to know which off. The reasons are lying in which of the substance. And that was done using this K four. So this gift knows Richard later belongs to which which of the five subsets. So we need to espouse. Find out here and again viewers Mistress, father in jobs equal to minus one And no, let's run it So it will take some time and it will try to do care for cross validation on all the fight substance. And in the first i treason in which one off the five subsets waas treated as whole or pistes it. The school was dear 0.97 in the secure nitrous and it's again you're a 0.9739 Hiroto inherits 0.95 Then Europe 196 So this is Oh, a basic random classifier model is built in a hoe. You perform cross validation 26. Random Forest with holdout test: know that we have learned to build any value to be secret and no forest model. Using cross validation, let's learn to create a news holo tested. Explored the results in a bit more detail. When this exercise we will be using Ah, if you pedometers and attributes offer our new first classified. So let's see what do it perimeters and attribute mean? So here you can see one paramilitaries in estimators and this is the default value then, so an estimators denote the number off trees in the forest, decision trees and before terrorist in next is Maxim left. So Max left means under deft off individual decision trees, so by default it will read none. So it will. There will be no constraint on their deft, and it will go as deep as required to minimize the loss criteria. On the next is in your abs, which do not the number off yours to run in parallel boat four foot and predict so were deported will be none. None means one. That is when your can run in battle. So in our previous really also, we had used it minus one for minus one on the notes used all the processes and next we have one attribute Very important, actually called feature importance is and it returns the seizure importance off each of the feature. And are you? Number denotes that the future is more important. So let's see our notebook. So this is the same court that we saw in our previous will. You meter the NATO and then we added features Mrs Linton put on jusen percenters. Then we tried or NATO using TF. Idea of a treasure and me Khunkitti nated the masses length on tuition and the little words network traced as the input feature X no, let's oh explode a random forest Withhold or Pistor. So first, we will need to import precision recall effort Score support from Escalon metrics and we will imported as a sort your official school so that we don't need to write this complete big team. So again so and next we will import train just a split from his killer, not mortal Seligson. No, we will split our data into training just sits. So for that we will use train just a split and the harder it is very important to forest is extreme. And then way train on and away just and here we will pass this extractors so some of this will be world for or distinct purpose. And we will assist for you what percentage we want. So first is this input feature and then the ways Nitta Lebanon. And then we need to satisfy what percentage of this we want to Rio for this. So for this we will need to satisfy test site and let's keep 10% of the litter for this suit. So we express played you 2.1 aan den. We will important there and no forest classifier it's and let's store it in. Or if and as we saw in the Slade we need to display in estimators my default, it will return. So let's make it 100. So 100 decision trees, Max, deft instead off none. We will make it some limited left 15 and n jobs make it minus one so that all processes are used. On day we will clear district model and there's a story to nor if mortal And we will call our if nort foot and we will pass this extreme and we'll weight train that we got here using train just a split. So this will contain 90% of this data. And finally, so let's run it. Oh, so invalid. So there is some type. Where? So no, we got the foot mortally. Not if Morgan so no, we will use this and we'll see. Ah, no feature importance. Is there a meter solar? It's see what are the most important features. So you see a list off, Um, some importance of school and it's not tortured. Sure looks 40. And by default, it will sort in ascending order again. No, it's not a very informative let's ah also tests the columns off extra in, and there was Combine it using gyp. So the the columns off extreme and the feature importance is. And let's print 1st 5 features of this. So you see there first Fisheries messes lind so massively into the most important feature, and these are the word indexes or column indexes. In on this, we don't have any direct raised eat. A word was converted to some number. So did you know dark? The ward number 246 is the second most important word. So this is a very handy if ah active route and it can be used toe list the most important features in our modern. Next we will predict on the tests. It may critics in on this and that's the store it and wipe rid and we will use the mortal that reputed and we will Positive ec sticks. So this is the 1% of the details stored in X distant latest. So this will return an area and it really storing wipe heard. So we do in the store with the crisp morning values er hammer spam and next we will use Oh , this precision recall of school support schools and the name specified order so it wouldn't return four different types of schools and they will call it using this for turn. And here we need to pass Wait just that is the actual levels and wipe heard the predicted levels. And we need to satisfy what is the post to level. Oh, because you're precision and recall that you know these deafness since the values will change when we changed the was due level. So in our case, our aim is to predict the spam. A message is or remains for here the poised to level is spam and Next bill pass a bridge as a boundary, so are very different to say irritable, being none. That means the school's off each class or returned. So here we said to to mind re, because we only want the results off spam levels. So whatever is the porch to level, it will return the results. Fourth in that class on Lee and here the levels would be by angry. They must be blindly. Oh, and in our case it's mind because it's either him or spam. So it will return the four schools. No, we will need the accuracy. Also, the Corsican recalculated are comparing the predictive value and actually destroy values so it will return Jor one. So we will some, all those values and then divide medo number off. Levinson Just certain. So this would return that Chrissy No, let's bring all the different schools it's and let's run it. So you see that precision is one. That means oh, whatever Project says was to live in their did you spend is in fact, a spam. So the number or just in this scenario, it's very important that the male classify for spam class for doesn't put any good emailing to Spam folder because that can lead to some serious or damage. But if Mrs some of the spam and it lands him to rain books, it's fine. So they're precision equal to one means that or tour. It's a It's a spam it's actually spent. So this is 100% and a Chrissy's 94%. That means off whatever the there are 100 emails, and it predicted spam or ham, and 94% of that was correct. And recall his 0.41 That means off all the spams. It was ableto identify 44 41% of them, so remaining roughly 59 percent of the stance landed in between books. So this is not good. We need to improve the call, but the precision is they really were