Natural Language Processing 1 : Zero to Hero w/ TensorFlow | Aditya Shankarnarayan | Skillshare

Natural Language Processing 1 : Zero to Hero w/ TensorFlow

Aditya Shankarnarayan, That Indian Coder Guy

Natural Language Processing 1 : Zero to Hero w/ TensorFlow

Aditya Shankarnarayan, That Indian Coder Guy

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
14 Lessons (41m)
    • 1. Welcome

    • 2. Neural Network

    • 3. Google colab

    • 4. Tokenization

    • 5. Coding: Tokenization

    • 6. Vocabulary

    • 7. OOV

    • 8. Coding: OOV

    • 9. Padding

    • 10. Coding: Padding

    • 11. Dataset

    • 12. Coding: Dataset

    • 13. Class Project

    • 14. Wrap Up!

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

Learn NLP the right way!

Easy lesson with detailed example.

Lesson are explained thoroughly.

Tensorflow will be explained with great examples.

Part 2:

Topics covered in this class will set you up in the right direction.

Basic Python knowledge is required.

Meet Your Teacher

Teacher Profile Image

Aditya Shankarnarayan

That Indian Coder Guy


Hello, I’m Aditya
Iam a Programmer who is passionate about helping students become better coders. 
I have been programming since I was in the 8th grade and have been teaching for over a year now. I primarily focus on Data Science and related topics. 
I hope you find value in my classes and learn a lot from it. 


See full profile

Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Your creative journey starts here.

  • Unlimited access to every class
  • Supportive online creative community
  • Learn offline with Skillshare’s app

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Welcome: Welcome to natural language processing or NLP. In this class, we'll use TensorFlow to possess text for natural language processing. Unlike images which come in this regular shape lenses of pixel intensity values, text is a lot missing. Texts are long, short is different punctuation. And then there is a problem of using verbs are letters for that processing. In this class, we'll tackle all these questions head on. In this class, we learned about tokenization, word embeddings and recurrent neural networks or RNNs, and it's types such as LSTMs. Then we'll alter the number of sequence models. By the end of this class, we learn how to build a model that can write like Shakespeare. And that sort of language processing is a huge topic and has many different aspects to it. It's just tokenization, word embeddings, RNNs, et cetera. So I have decided to divide this class into smaller classes so that it is easily digestible. In this class, validity would tokenization. In simple words, it is converting ports or text into something that a model or a computer network can understand. As we know, neural networks deal with numbers where it's odd numbers, biases are numbers. So we need to create a sensibility of converting these takes into something that the neural network or a machine can understand these. So we'll take a look on how to convert port and Pluto. And so that machine can understand these words as well as differentiating between these words. We also take a look on how to pair the sentence if a sentence is shorter or longer. So my next few lessons are from a previous class about image classification. Deep learning again, is deal with the different aspects that we will require for this further lessons. One, we'll deal with the neural networks and other will deal with the platform that we're going to call on, which is called as Google Colab. If you are familiar with these aspects, you are free to skip ahead these two lessons. So I hope to see you take my class and I look forward to seeing you in the next lesson. 2. Neural Network: Before we start, I would like to familiarize you with certain aspects of a neural network. A neural network is made up of layers, input layer, hidden layer, and output layer. The input layer is where input is given to the network. If you're making an image classifier, we will input the pixel values of the images. The hidden layer is where all the heavy lifting happens. Each hidden layer has a fate and bias associated with it. The value of weight and bias helps to neural networks to learn different things. How does learning takes place is out of the scope of this class, and it requires a mathematical and calculus background. Luckily, all this math is implemented for us in the form of TensorFlow functions, making it extremely easy couple, having the background will certainly help you build a better neural network. But it is not required for beginners to get a basic understanding of neural network. I recommend you to go and check out playgrounds dot This is an excellent tool to understand how things work in the network. This is a classification problem where the background will type to map the respective colored dots. Try out different combination and test how it affects. Certainly, the learning rate is how fast or slow. When you're late work will learn to have a value very high and will cause problems reaching a reasonable solution and having a value to low will increase the learning speed. Activation, activation is applied to each layer in the network, except the input layer. The activation basically ships the value range of the particular neuron. For example, if you choose a ReLu, ReLu shifts the value of the negative numbers to 0 and we'll keep the positive numbers. And if we change it to sigmoid, sigmoid will transfer all the values will range between 01. So it can transfer values between 01. We use them in binary classification. Regularization helps us overcome overfitting and underfitting, which you will cover in a later lesson. So I recommend you guys to check out how different classifications work and test out different combinations. 3. Google colab: Google Colab is a free cloud service where you can improve your Python programming skills and develop deep learning applications using popular libraries such as TensorFlow and Keras. Unlike coding on your local machine and downloading bunch of libraries, all these libraries will be pre-installed. So saving up lots of space and local machine. As machine learning or deep learning is a computationally heavy task, we can use the PGP provided to train our model faster. The only requirement is an Internet connection and then Google account. So when you open up, pull up, you will be greeted with a welcome page. You can read it and learn more about it. Still, maybe skip it, and then click on notebook. This notebook will be saved on your Google Drive. You can save it in a local machine by going here and saving it as a Python notebook. As a Python file. We change it to dark mode and go to tool settings. And under settings, you can find teams that you continue to dark mode and then hit Save. So to execute code, Negro to write that code, then press Shift Enter or hit the Display button, and this will execute our code, and then the output will be displayed below the segue. I want to share more information about Colab, which I will do in a later lesson. So this was a basic introduction to Google Colab, which will be sufficient for our use case to get in-depth knowledge about it and go to tensor for YouTube page to find more information. 4. Tokenization: Welcome to this new lesson. In this lesson, we'll deal with tokenization. As discussed in a previous lesson. Tokenization is all about converting text into something that new network can read. So we can take many approaches at doing so. What do you think are these approaches? The first approach is converting these letters in Exodus president character ascii values. As you can see, I have two words, silent and listen. They basically mean the opposite of each other. But as you can see, but the letters are the same, but are rearranging different manner. They both have the same ascii value as you can see. So it becomes a daunting task for training a neural network that can understand the difference between these two words. So we'll take an approach using words per tokenization. We encode the words that VC. So we encode the word I as one, love as to my S3 and S4. So now we have four words and therefore encodings. So what if we have a new sentence? I love my cat? As we already have encoded a three words, are that I love my window and you encode these words again. And we only need to encode the new word cat. And we set it as an encoding of five. And this process is called Les tokenization. So does decode for tokenization. As you can see, we use TensorFlow and Keras for our tokenization and machine learning purposes. Hello and carriers have powerful functions that we can use for our advantage and can tokenize these words very easily. So as you can see, we use the tokenizer function to tokenize these words. Then we have this parameter called as numWords, which has the value 100. So this parameter will choose the top 100 words. If C is in the sentences that we give it for the tokenization purpose, then we use the neck and function called as fit on text. And then we pass the sentences. Then it will tokenize all the words. Target will see in the sentences. Var index will be our dictionary that we can see. There are different tokens that was created for us. So it will look something like this. I is the key and as the value when my dog for Cat 5 and 2. So as you can see, this is a dictionary which contains the different values. That is, I, my dog club. These are the unique values that it sees. We had two eyes, three minus two lows in our sentence, but it only used one. As you can see, the eye is not capitalized. That is because tokenization ignores the capitalization and only encodes words which add in smaller cases so that the vocabulary statesman. So in the next lesson, really the sequences, how we can create a sequence with these two pins. 5. Coding: Tokenization: In this lesson, we'll be looking at how the cold box for tokenization. So I hope you guys are comfortable with Google Colab. If you are not, you can check out the previous lesson, which is completely connected to Google Colab and how you can operate it. So now as you can see, the code remains the same, but there are a few changes. The changes are, there is a comma and then there is a tells an exclamation mark. So what the woods in the tokenizer will do? A tokenize these punctuations as themselves, like Van Gogh can be a gamma and delta token will be an exclamation point. Or tokenize these punctuations with the words and then associated with what do you think? Let's see, modelled will do. As you can see, we do not have any punctuation, exclamation or comma anywhere. The eye does not have any cover. Neither does a dog have an exclamation point. Why do you think that is? It's because if we tokenize all the different characters that the exclamation come and take. Our vocabulary. Males grow huge, therefore, we do not consider exclamation or any punctuation marks by tokenizing text. So this word index converts their different tokens of the word data are given in the sentence level as a token to my hazard open do I as a token 3, dog as a token for CAG five, you six. This order of tokens is based on the number of times the word appears in the sentence. As you can see in the tokenizer, only concern during word, TOP 100 words as tokens. So a low appears three times in the sentence, and it occurs with my solo has their glucan one miser glucagon do I occurs two times and it has the value three. So I have given you an example how tokenization works and have shown how it presented different punctuation marks. In the next lesson, we'll deal with sentences that are not seen before or do not have their group. And I hope to see you in the next lesson. 6. Vocabulary : Welcome to the US. In the previous lesson, we saw how to convert words into tokens and how the tokenizer function in Keras TensorFlow deals with punctuation marks as well as capitalisation. In this lesson, we'll deal with sequences. But before that, I'd like to convey some technicalities. So dictionary is called as a vocabulary. So vocabulary basically are a set of tokens that a machine can understand. So sequences, converting the tokens that we already have and do some meaningful sentences. And we have seen. So we use this function, takes two sequences which will convert the sentences in crypto tokens that we already have. So we use text to sequences function in the sentences that are present here. S to the code remains the same. So for example, I have a vocabulary which already has two pins of all these words and their tokens that I have is 0, is 1, my S2, and S3. So when I use this function and convert that sentence into a sequence of tokens, it will give me a list of 0, 1, 2, and 3, as we are giving it a center of sentences as a nested list, which basically means our list inside a list. If you are familiar with machine learning, you know that we always test our data in a data that is unseen. So now we get a tokenizer function, new data collectors data. I really loved my dog. My dog loves move NOT, as you can see, we did not convert the word present in these sentences. Tokens, such as really and Manetti. So the vocabulary does not have the word are really loves and vanity. So if we run this factors that Baden two sequences on this test data, we look at the results and things like this. I have a nested list which has two smaller lists inside it. These two smaller list and they presently the peasant, these two sentences. And this is the vocabulary in this MOOC. Have ready is Theta I have from the sentences that were shown in the sides before. As you can see it correctly tokenize these two sentences. You can check the sentences with the vocabulary given below for your reference purpose. As you can see, the function correctly tokenized word for I love my dog, but refused to do for really. Also skipped loves and vanity in the next sentence. Now because our vocabulary small, as well as the contents is that we trained our vocabulary on was also small and we can easily point out the different words that were missing. But if the vocabulary was large, are the sentences were a lot, then it will become really difficult for us to directly pinpoint the words that are missing in the vocabulary. So to tackle these questions, we'll see how to do this in the next lesson. 7. OOV: Welcome to this new lesson. In this previous lesson, we saw how the tokenizer function deals with word. That is, that it hasn't seen before, or pays with works. It does not have the Copan for it skips it. So it is difficult for a program or a coder to understand if the tokenizer function has kept award. So we can use something called as an OB token. This is a token that can be assigned to words that are not present in the vocabulary. So this court does exactly that. So I pass a new parameter, the tokenizer constructor called as o, the token, and I gave it the tag o, v. So whatever word it encounters in the test data that is out of its vocabulary, it will give it that tag of o v. So when I give it the same test sentences that I gave it earlier, a tokenizes it and creates a sequence of tokens that test sentence that were given as the same that was given earlier. It did not have the words or tokens for the words really loves and manatee. But now those words aren't replaced with 0. So whenever it encounters with words set without over 200 capillary, it will give it the token o v. So then the programmer can correctly identify the egg and countered out of vocabulary word. So for the sentence, I really love my dog, which was in tokenized correctly in the previous slide, is now tokenize to a certain extent. As you can see, I love my dog. The sentence is correctly token-based, whereas really, which wasn't present in our vocabulary, has been given the token one. So if you've seen our vocabulary, one is the value for the key value pair of 0 v. So really has been given the token o v. So for whenever the function encounters Award, which is out of its vocabulary, it will assign it that specific token of 0 v. So for the next sentence, Meyer dog loves my humanity. As you can see, it correctly tokenize for the word my dog, my whereas indicates the OV token for love's and vanity. So I hope you understand this concept. And no plus in the next lesson. 8. Coding: OOV: In this lesson, we'll see how the code looks. So most of the code has remained the same. But as you can see, we have added a new parameter for the tokenizer constructor called Lascaux. We talk and we have given it the value of Ove as a tag. So whenever it encounters a word that is not present, there is vocabulary. It will give it that token. So we'll create our vocabulary using these sentences about. So after creating the vocabulary, you can see and the first position, we have a new tag called as o. So whenever it encounters a word that is not present in its vocabulary, it will give it that token. So below that you can see we have the sequence is further sentences that we use to create the vocabulary. But as you can see, it does not have a single one tag on a void tag. That is because we use these exact sentences to create our vocabulary. So now equal to the edge sentences and see how it applies the function there. So as you can see, it does not have the vocabulary tokens for really loves and magnetic. So when we run, takes two sequences function on these sentences. We can see that we have one as the tokens used for Healy, loves and managing. So when it was parsing the sentence, it came across I wish it had the broken for so it assigned a book and then it came across really. So as it really was not present in the vocabulary, it gave it the Ove on the outer vocabulary token. Next, it came love, which assigned a specific token for then same for my end up. For the next sentence, it assigned my dog gets respective tokens. Mud for loves. It did not have a single token, so it gave it the oviduct and it applied the respective token for my, and applied the ovary token for magnetic. So this is how the OV token works. And I hope you understand this concept and our place in the next lesson. 9. Padding: Welcome to this new lesson. In the previous lesson, we saw how the tokenizer function approaches outdoor vocabulary words. It assigns it DO we took in? So in this lesson, we'll learn more about padding. So if you've taken my previous class amount image classification, you will already know about padding. So we use padding in image classification to add a border around the image. So when we perform convolutions, it did not lose important information of the image. Padding in NLP is almost similar to that of image classification. So in padding, we use it so that all our sentences are of the same size. We can do it by this code right here. So as you can see, I have created a sequence of tokens using text to sequence function and stored it in the sequence is variable, as we have done in the previous lessons. After doing that, I have passed it into a new function called S pair sequences. After passing it to this new function, all the sentences in the sequence will have the same length. That is, it will add zeros to all the sentence is present in the sequence is a table. So that every single sentence present in the sequences will have this same length. So now, in the next lesson, we'll discuss how the code works and how it can pass more parameters to this function so that we can change this paired sequences to our liking. 10. Coding: Padding: Welcome to this new lesson. In this lesson we'll discuss about pair sequences. In the previous lesson, I told you about how the pad sequences work. And now we'll talk about how pads sequences work in code. And we'll also discuss about a few more parameters that we can add to the paired sequences constructor or changing third different behaviors. So now the first step is to import parity sequences from TensorFlow dot, dot pre-processing sequence. The next few lines, the same value. You can see there's a gene here. We called the bad sequences function and parse the sequences. That is, this, the sequences are just the tokenized version of the sentence. Is there a few extra parameters that we'll talk about later? Then we print out Dupain of sequences. 11. Dataset: So far we have been looking at texts and how to tokenize the text and then return sentences into sequences. Using the tools available in TensorFlow. You get that using very simple art critic sentences. But for this, when it comes to a real-world problems, you'll be using a lot more data than just the simple sentences. So in this lesson, we've taken a look at some public datasets and how you can process them to get them ready to train a neural network. We'll start with this one published by the shock Misra, which details on Kaggle at this link. This is a really fun public domain dataset at all around sarcasm detection dataset is very straightforward and simple. Not to mention very easy to work with. It has three elements in it. The first is sarcastic. It is early when it is one of the records is considered sarcastic. Otherwise it is 0. And the second is a headline, which is just plain text. And the third is link of the article that the ad lend describes. So the code will look something like this. So headline is the actual sentences that we are going to use. Is sarcastic, is the labels that are required to train our model. And URLs is basically the article link that is given a dataset we are going to use. So the core for doing all of this will look something like this. First, we import the JSON module, then we open the JSON file with R, which means read. And then we store or lower the JSON file in a variable called as datastore. Then we have three empty lists, sentences, labels and URLs, which will correspond to the different aspects present in the dataset. Then we have a for loop, which will loop over all the items in the data store. And then we'll append the corresponding items, curves, corresponding list, that is headlines you'll connect the sentences, is sarcastic being our labels will be appended to the labors English, and the article links will be appended that you hadn't. So this is the code that we'll use to actually process that it doesn't. This is the scene coordinate we have been seeing for the past few lessons. So in the next lesson we'll see how the JSON file looks. How you can bring your JSON file real Google Colab environment. And you can also find that JSON file in the description below. So I hope to see you in the next lesson. 12. Coding: Dataset: Welcome to the final technical lesson of this class. So in the previous lesson we discussed about the JSON file. So this is a small snippet of this edges and file. As you can see, it has three points in it. That is the article link, the headline, and a syntactic label. If you are familiar with machine learning, you load that whenever we are training a model, we require the labels for it. That is, we need to define before testing are more than that. This is something and this is something else. So if you are painting an image, test, if it's a cat or a dog, then we train it by showing the image that this is a cat and then showing the image of a dog and then telling it this is a dog before actually using it and testing it on unseen data. So if you want to use that JSON file here, Google Colab environment, you just need to download the JSON file on your computer and then drag and drop the JSON file in the Google Colab environment under the folder section. And then after a few seconds, it will appear inside your environment. But makes sure that after you reset or shutdown the environment, the JSON file will disappear. So you'd need to import it once again. So now we'll start with the code in. The first thing is to input the JSON module, which can be imported using this code that is import JSON. This module will help us to read the JSON files and convert them to our liking. Then we open the JSON file using this function. And then mentioning the path of the JSON file. This depends upon where do you place your JSON file in your JSON in your environment? The ad represent that we're reading this JSON file. Then we store or load that JSON file into a variable called as datastore. Then we create empty lists. These lists represent that different aspects of the dataset. And our empty list, that is sentence labels and URLs. These three empty lists, we store the different atoms that are present in this ASM file that has head lens is sarcastic label and the article link. So now we create a for loop of the items in the datastore, that is a JSON file. And we append the respective aspects or the item for the respective lists that we append the headline to the sentences. A sarcastic to the lake will send article linked to the URLs. So we divide the JSON file into its respective attributes that they sentences they will send URLs will only be using these sentences in this class that we are going to require to create tokens per application, then the code remains quite similar. The only difference here is we print out the first element in the padded sequence. As I told you before, padded or any other sequence is a nested list. That is a list inside a list. Therefore, we will be printing the first list present inside the list. This will print out the shape of the padded list, so we'll see how this code runs. So after reading this code, we did this. So firstly geared toward index, that is the size of the word index. Word index is our vocabulary. We have a word index of 30000 words, which is far greater than the war index that we were using before. So this is our vocabulary. As you can see, two is the actual word in the vocabulary. And this order is defined, is defined by how many words occurred inside the sentence is set. The next is we print out the padded policed. So this is the first sentence inside depended variable. We initially gave the parent function on ligase, and then it added the extra zeros for us, for the every single element in the parrot variable of the same size. So this sequence is what we got after padding it. And the next thing we printed out was padded got shape. Now we have 25000 different sentences in their parotid variable, and each of them has their shape of 40. So this first how that we use the external public dataset for our application. I hope you enjoyed this class. And then the next lesson we'll discuss about the class project for this class, this aspect or this set of code is really important. So I hope you practice this code and give it some time before moving on. 13. Class Project: Thank you for sticking around for so long. So now we will discuss about the class project. So the class project will be very similar to what we have done before. So first, you have downloaded sarcasm dataset. You can find a sarcasm dataset in the link below. I have already edited this had chasm dataset so that it will be easier to be important in Python and developed on. Then you'll create a vocabulary using the dataset. Then you have to perform all the different tasks that I have mentioned in the previous lessons. And then we'll use the bad underscore sequences function to pair the sequences. Now, after doing that, you'll tokenize a random sentence or code that you like and then shared it with the rest of the class in the project section. Because everyone is using the same vocabulary individually fun to guess that different core 1010 tenses that had been used. So I hope you enjoyed my class and the next lesson will be the final lesson of this class. Thank you. 14. Wrap Up!: Thank you for sticking around for so long and taking my class. So a quick summary, we've got to get APA. So we learned about tokenization and what does tokenization? We also learned how to create a vocabulary. Then we created a sequence of sentences using our Coke and vocabulary. Then we learned about over a token that is outer vocabulary token for words that aren't present in our vocabulary. Then we learned about padding, which adds zeros before or after the sentence so that all the sentences in their sequence of this scene length, then we used a public dataset for our application. So this marks the end of our NLP class. And if you're interested in machine learning and deep learning, it can take my other class called as image classification, where we used the misclassification to classify different clothing items, as well as classify between a dog and a cat. It is filled with knowledge. And if you enjoyed this class, I know for a fact that you'll enjoy that too. Feel free to ask questions in the discussion section below and share their project in the project section. I hope you'll leave a review as it helps me to create virtual classes. I thank you all once again for taking my class.